Efficient way to filter on large number of samples by an entry value

If I have a MT indexed by row keys [‘locus’, ‘alleles’] and columns key ‘s’, what is the best way to filter this MT by a certain entry value? For example if i wanted to identify all samples with a certain genotype, denoted by an the field ‘GT’, how could i do this efficiently for either one specific locus or possibly many different loci in the row index?

Can you be a little more specific? Perhaps an example and what you’d want as output?

Does this mean identify all samples per row (variant), or overall?

Yep I mean per row.
If the table has keys:

Column key: ['s']
Row key: ['locus', 'alleles']

And a certain row is indexed :
|locus<GRCh38>|array<str>|
|chr7:5991910|[G,A,C]|

What would be the best way to identify the samples with genotype 0/1 for example in that row, when GT is an entry field.

By “identify”, do you mean collect the list of sample IDs per variant that are heterozygous? This would work:


mt = mt.annotate_rows(
    het_sample_ids = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.s))
)