Hello. What is the proper way to query multiallelic sites in a MatrixTable containing hundreds of samples? I know you can ask for hl.len(mt.alleles) > 3
, but what can you do for multiple samples?
Can you be specific about the query you want to write? You’re correct that:
mt.filter_rows(hl.len(mt.alleles) >= 3)
will return a matrixtable with only multiallelic variants.
Let’s say we have a row, for which we have hl.len(mt.alleleS) == 8
. We also have 3 samples at this position each with the following GT 1/2 0/1 2/3. How can we query what alleles each sample has and what they are?
@matin ,
I directly answer your question below, but I think we can give you a better answer if you tell us what you’re ultimately trying to accomplish. Are you trying to run a GWAS on these alleles? Are you doing a burden test? Something else?
Hail MatrixTables are not indexed by number of alleles, so this pipeline necessary scans across the entire dataset. For large datasets, this will take a long time. If you plan to do a lot of work on this subset of the variants, you might want to filter once and write to a smaller file and use the smaller file for your analysis.
mt = hl.read_matrix_table(...)
mt = mt.filter_rows(hl.len(mt.alleles) >= 3)
mt.show()
Will print the top-left corner of the filtered matrix of genotypes to your Notebook or terminal.
If you want to export those genotypes to a sharded VCF you can replace mt.show()
with:
hl.export_vcf(mt, 'gs://path/to/file.vcf.bgz', parallel=True)
If you want to collect those genotypes as an array of arrays of genotypes, replace mt.show()
with:
mt = mt.annotate_rows(genotypes = hl.agg.collect(mt.GT))
array_of_arrays_of_gts = mt.genotypes.collect()
To elaborate on my previous comment: say we have three samples in our MatrixTable. It also has a row at locus 5:5000 for which we have an alleles
field that contains 8 values ( e.g, ["CTGTGTGTGTGTG","<*>", "C", "CTGTG", "CTGTGTG", "CTGTGTGTG", "CTGTGTGTGTGTGT...
). We have called variants for each sample at this locus. Assume that the GT values are 1/2, 0/1, and 2/3, respectively for each sample. Now I want to query the ref/alt values for say, the first sample. my ultimate goal is to map alt/ref’s at this row to each sample.
Are you asking how to index into the alleles array? Meaning if you have these GTs with your shared alleles field:
1/2 0/1 2/3
You want these values:
("<*>", "C") ("CTGTGTGTGTGTG", "<*>") ("C", "CTGTG")
?
You can index both the alleles array and the call like normal Python arrays:
mt = mt.annotate_entries(
first_allele_sequence = mt.alleles[mt.GT[0]],
second_allele_sequence = mt.alleles[mt.GT[1]]
)