Efficient way to filter on large number of samples by an entry value

jimmy_1 · May 9, 2021, 5:44pm

If I have a MT indexed by row keys [‘locus’, ‘alleles’] and columns key ‘s’, what is the best way to filter this MT by a certain entry value? For example if i wanted to identify all samples with a certain genotype, denoted by an the field ‘GT’, how could i do this efficiently for either one specific locus or possibly many different loci in the row index?

tpoterba · May 10, 2021, 12:11pm

Can you be a little more specific? Perhaps an example and what you’d want as output?

Does this mean identify all samples per row (variant), or overall?

jimmy_1 · May 10, 2021, 2:34pm

Yep I mean per row.
If the table has keys:

Column key: ['s']
Row key: ['locus', 'alleles']

And a certain row is indexed :
|locus<GRCh38>|array<str>|
|chr7:5991910|[G,A,C]|

What would be the best way to identify the samples with genotype 0/1 for example in that row, when GT is an entry field.

tpoterba · May 10, 2021, 2:38pm

By “identify”, do you mean collect the list of sample IDs per variant that are heterozygous? This would work:


mt = mt.annotate_rows(
    het_sample_ids = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.s))
)

Topic		Replies	Views
Filter_rows using row_key of a different source Hail Query & hailctl	5	870	March 8, 2019
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1685	January 8, 2019
Add row annotation with label based on entry field of one sample Hail Query & hailctl	2	486	February 11, 2022
Filtering By Genotype Hail Query & hailctl	7	755	June 20, 2020
Individual GT call output handling issue Hail Query & hailctl	2	41	November 18, 2024

Efficient way to filter on large number of samples by an entry value

Related topics