Filtering/annotating haplotypes

It seems quite easy to filter individuals by the presence or absence of a single variant, but I haven’t found a way to do the same for haplotypes.
I tried to create a two-snp haplotype-based annotation on my samples, but I couldn’t find a simple way of doing this with a logic statement.
Here’s what I tried, but it doesn’t seem possible to get to a specific set of entries for an individual.

mt = mt.annotate_cols(
                 .when(mt[[locus1,alleles1],mt.s].GT.is_hom_ref & mt[[locus2,alleles2],mt.s].GT.is_hom_var, 'haplotype1')
                 .when(mt[[locus1,alleles1],mt.s].GT.is_hom_ref & mt[[locus2, alleles2],mt.s].GT.is_hom_ref, 'haplotype2')

Is there an intuitive solution to this problem that I’m missing? (There is of course the option to do a series of filters to achieve the same result. Or maybe setting bitwise flags on the entries…)

Thanks team!

ok, so i just implemented this using a bitwise notation like this:
mt = mt.annotate_entries(
.when((( & (mt.alleles==alleles1) & mt.GT.is_hom_var()), 20)
.when((( & (mt.alleles==alleles1) & mt.GT.is_hom_ref()), 0)
.when((( & (mt.alleles==alleles1) & mt.GT.is_het()), 10)
.when((( & (mt.alleles==alleles2) & mt.GT.is_hom_var()), 2)
.when((( & (mt.alleles==alleles2) & mt.GT.is_hom_ref()), 0)
.when((( & (mt.alleles==alleles2) & mt.GT.is_het()), 1)
mt = mt.filter_entries(mt.hap_bit >= 0)
mt = mt.annotate_cols(hap_score=hl.agg.sum(mt.hap_bit))`

and I think that did it.
so long as the haplotype isn’t big, this isn’t such a bad way of doing it.

Doing this kind of lookup-and-aggregate-at-the-same-time is a somewhat tricky thing to implement in a nice way in a distributed setting, but you can get around that by doing something like the following:

gts_at_loci_of_interest = hl.agg.filter(
    ( =='1', 1000)) | ( =='1', 1005)), 

mt = mt.annotate_cols(
        .when(gts_at_loci_of_interest.all(lambda x: x.is_hom_ref()), 'hap1')
        .when(gts_at_loci_of_interest.all(lambda x: x.is_hom_ref()), 'hap2')

oh this is good strategy too. thanks!

Your solution above with bitwise sum is also a super clever way to do it – you could do something like annotate each locus with locus_weight = 10 and 20 and then sum hl.agg.sum(mt.GT.n_alt_alleles() * mt.locus_weight) to avoid needing to write so many match statements.

1 Like

er, would have to be 10 and 30 above – otherwise het/het and homref/homvar would have the same sum.