Filtering/annotating haplotypes

nlw · July 22, 2020, 7:05pm

It seems quite easy to filter individuals by the presence or absence of a single variant, but I haven’t found a way to do the same for haplotypes.
I tried to create a two-snp haplotype-based annotation on my samples, but I couldn’t find a simple way of doing this with a logic statement.
Here’s what I tried, but it doesn’t seem possible to get to a specific set of entries for an individual.

mt = mt.annotate_cols(hap=hl.case()
                 .when(mt[[locus1,alleles1],mt.s].GT.is_hom_ref & mt[[locus2,alleles2],mt.s].GT.is_hom_var, 'haplotype1')
                 .when(mt[[locus1,alleles1],mt.s].GT.is_hom_ref & mt[[locus2, alleles2],mt.s].GT.is_hom_ref, 'haplotype2')
                 .default('other'))

Is there an intuitive solution to this problem that I’m missing? (There is of course the option to do a series of filters to achieve the same result. Or maybe setting bitwise flags on the entries…)

Thanks team!

nlw · July 22, 2020, 7:29pm

ok, so i just implemented this using a bitwise notation like this:
mt = mt.annotate_entries(hap_bit=hl.case()
.when(((mt.locus==locus1) & (mt.alleles==alleles1) & mt.GT.is_hom_var()), 20)
.when(((mt.locus==locus1) & (mt.alleles==alleles1) & mt.GT.is_hom_ref()), 0)
.when(((mt.locus==locus1) & (mt.alleles==alleles1) & mt.GT.is_het()), 10)
.when(((mt.locus==locus2) & (mt.alleles==alleles2) & mt.GT.is_hom_var()), 2)
.when(((mt.locus==locus2) & (mt.alleles==alleles2) & mt.GT.is_hom_ref()), 0)
.when(((mt.locus==locus2) & (mt.alleles==alleles2) & mt.GT.is_het()), 1)
.default(-1))
mt = mt.filter_entries(mt.hap_bit >= 0)
mt = mt.annotate_cols(hap_score=hl.agg.sum(mt.hap_bit))`

and I think that did it.
so long as the haplotype isn’t big, this isn’t such a bad way of doing it.

tpoterba · July 22, 2020, 7:29pm

Doing this kind of lookup-and-aggregate-at-the-same-time is a somewhat tricky thing to implement in a nice way in a distributed setting, but you can get around that by doing something like the following:


gts_at_loci_of_interest = hl.agg.filter(
    (mt.locus == hl.locus('1', 1000)) | (mt.locus == hl.locus('1', 1005)), 
   hl.agg.collect(mt.GT))

mt = mt.annotate_cols(
    hl.case()
        .when(gts_at_loci_of_interest.all(lambda x: x.is_hom_ref()), 'hap1')
        .when(gts_at_loci_of_interest.all(lambda x: x.is_hom_ref()), 'hap2')
        .default('other')
)

nlw · July 22, 2020, 7:31pm

oh this is good strategy too. thanks!

tpoterba · July 22, 2020, 7:32pm

Your solution above with bitwise sum is also a super clever way to do it – you could do something like annotate each locus with locus_weight = 10 and 20 and then sum hl.agg.sum(mt.GT.n_alt_alleles() * mt.locus_weight) to avoid needing to write so many match statements.

tpoterba · July 22, 2020, 7:42pm

er, would have to be 10 and 30 above – otherwise het/het and homref/homvar would have the same sum.

Topic		Replies	Views
Set sample genotype to missing based on different conditions for SNPs and INDELS Hail Query & hailctl	2	486	February 15, 2019
Filtering By Genotype Hail Query & hailctl	7	755	June 20, 2020
Do nothing if it is homozygous reference 0/0 Hail Query & hailctl	2	485	February 12, 2019
Outputting sample IDs for multi allelic sites Hail Query & hailctl	5	201	April 10, 2024
Filter variants close to indels Hail Query & hailctl	2	313	April 28, 2022

Filtering/annotating haplotypes

Related topics