Multiallelic in multisample MatrixTable

Hello. What is the proper way to query multiallelic sites in a MatrixTable containing hundreds of samples? I know you can ask for hl.len(mt.alleles) > 3 , but what can you do for multiple samples?

Can you be specific about the query you want to write? You’re correct that:

mt.filter_rows(hl.len(mt.alleles) >= 3)

will return a matrixtable with only multiallelic variants.

Let’s say we have a row, for which we have hl.len(mt.alleleS) == 8. We also have 3 samples at this position each with the following GT 1/2 0/1 2/3. How can we query what alleles each sample has and what they are?

@matin ,

I directly answer your question below, but I think we can give you a better answer if you tell us what you’re ultimately trying to accomplish. Are you trying to run a GWAS on these alleles? Are you doing a burden test? Something else?

Hail MatrixTables are not indexed by number of alleles, so this pipeline necessary scans across the entire dataset. For large datasets, this will take a long time. If you plan to do a lot of work on this subset of the variants, you might want to filter once and write to a smaller file and use the smaller file for your analysis.


mt = hl.read_matrix_table(...)
mt = mt.filter_rows(hl.len(mt.alleles) >= 3)
mt.show()

Will print the top-left corner of the filtered matrix of genotypes to your Notebook or terminal.

If you want to export those genotypes to a sharded VCF you can replace mt.show() with:

hl.export_vcf(mt, 'gs://path/to/file.vcf.bgz', parallel=True)

If you want to collect those genotypes as an array of arrays of genotypes, replace mt.show() with:

mt = mt.annotate_rows(genotypes = hl.agg.collect(mt.GT))
array_of_arrays_of_gts = mt.genotypes.collect()

To elaborate on my previous comment: say we have three samples in our MatrixTable. It also has a row at locus 5:5000 for which we have an alleles field that contains 8 values ( e.g, ["CTGTGTGTGTGTG","<*>", "C", "CTGTG", "CTGTGTG", "CTGTGTGTG", "CTGTGTGTGTGTGT... ). We have called variants for each sample at this locus. Assume that the GT values are 1/2, 0/1, and 2/3, respectively for each sample. Now I want to query the ref/alt values for say, the first sample. my ultimate goal is to map alt/ref’s at this row to each sample.

Are you asking how to index into the alleles array? Meaning if you have these GTs with your shared alleles field:

1/2 0/1 2/3

You want these values:

("<*>", "C") ("CTGTGTGTGTGTG", "<*>") ("C", "CTGTG")

?

You can index both the alleles array and the call like normal Python arrays:

mt = mt.annotate_entries(
    first_allele_sequence = mt.alleles[mt.GT[0]],
    second_allele_sequence = mt.alleles[mt.GT[1]]
)