If I have a MT indexed by row keys [‘locus’, ‘alleles’] and columns key ‘s’, what is the best way to filter this MT by a certain entry value? For example if i wanted to identify all samples with a certain genotype, denoted by an the field ‘GT’, how could i do this efficiently for either one specific locus or possibly many different loci in the row index?
Can you be a little more specific? Perhaps an example and what you’d want as output?
Does this mean identify all samples per row (variant), or overall?
Yep I mean per row.
If the table has keys:
Column key: ['s']
Row key: ['locus', 'alleles']
And a certain row is indexed :
|locus<GRCh38>|array<str>|
|chr7:5991910|[G,A,C]|
What would be the best way to identify the samples with genotype 0/1 for example in that row, when GT is an entry field.
By “identify”, do you mean collect the list of sample IDs per variant that are heterozygous? This would work:
mt = mt.annotate_rows(
het_sample_ids = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.s))
)