Select certain samples from MatrixTable

Hello everyone

I work on standard Hail training dataset

hl.import_vcf('data_Haill/1kg.vcf.bgz').write('Haill_mt/', overwrite=True)
mt = hl.read_matrix_table('Haill_mt/')

with annotation

table = (hl.import_table('data_Haill//1kg_annotations.txt', impute=True)
mt = mt.annotate_cols(pheno = table[mt.s])

Finally i get following Matrix Table

        struct {
        s: str, 
        pheno: struct {
            Population: str, 
            SuperPopulation: str, 
            isFemale: bool, 
            PurpleHair: bool, 
            CaffeineConsumption: int32
    <hail.matrixtable.MatrixTable object at 0x7fd9b55dcb50>

The SuperPopulation have following annotations

frozendict({'AFR': 1018, 'AMR': 535, 'EAS': 617, 'EUR': 669, 'SAS': 661})

Could you please help me?
I need to work only with ‘AFR’ and ‘AMR’ samples, and i need to drop out the rest.
Could you please give me a code example how can i do it?

Take a look at MatrixTable.filter_cols:

mt = mt.filter_cols(
    (mt.SuperPopulation == 'AFR') | (mt.SuperPopulation == 'AMR')

Hi @danking

Thanks, thanks a lot for your help. Is there any tutorial or documentation on Haul about simple operations? I mean operations on DataFrame like in Python? For example df.loc , df. Something to handle a data? Add remove samples? Filter variants?
Also more examples would be useful, like you have already publish.
Thanks a lot in advance

Yes, see here: Hail | GWAS Tutorial

also the cheat sheets:

You might also check out the Hail How-To Guides

Hi @danking

could you please explain else one thing?
Now i would like to work with certain samples and filter out the rest.
For this i haw tried several methods, but it didn’t work
For example

mt.filter_cols( mt.key_cols_by().cols(“HG00096”,“HG00097”,“HG00099”,“HG00100”) )
TypeError: MatrixTable.cols() takes 1 positional argument but 5 were given

mt.filter_cols( mt.s == “HG00096”,“HG00097”,“HG00099”,“HG00100”)
MatrixTable.filter_cols() takes from 2 to 3 positional arguments but 5 were given

Could you please help me with this?

Hail filtering expressions are just Python expressions. Something like

abc == "pie","apple"

Isn’t valid Python syntax. In Hail, try using an & to join multiple conditions

(mt.s == "abc") & (mt.s == "def")

so… to filter 4 samples i should do flowing?
mt.filter_cols( (mt.s == “HG00096”)&(mt.s ==“HG00097”)&(mt.s ==“HG00099”)&(mt.s ==“HG00100”))

Sorry, you’ll want | instead of & to mean “this or that”.