Annotate rows based on GT

I would like to annotate rows in my matrixtable based on their genotype (GT). For instance, I would like to have the AD_mean for hom_var and the AD_mean for het.
AD is an entry with an array of size 2, for instance [20,20].
Currently I have the following:

mt = mt.annotate_rows(AD_mean_homvar =  hl.agg.filter(mt.GT.is_hom_var(), hl.agg.mean(mt.AD)))

This doesn’t work because AD is an array and not a float.
Changing hl.agg.mean(mt.AD) to hl.mean(mt.AD) also doesn’t work.

what is the definition of “AD_mean for hom_var”? How should AD arrays of [20,20] and [15,10] be aggregated across two samples to a single mean? Do you want the mean of [20, 20, 15, 10] or the mean of [20, 10] or something else?

I guess I figured it out:

I would like to get the total AD for every single sample which I do like this (so a single value per sample):

mt = annotate_entries(AD_tot = hl.sum(mt.AD)

For instance Sample1 [10,10] would be 20, Sample2[0,20] would be 20 as well.

Currently I just do the following:

AD_mean_homalt = hl.agg.filter(mt.GT.is_hom_var(),hl.agg.mean(mt.AD_tot))
1 Like

Isn’t mt.DP == hl.sum(mt.AD) by definition of the VCF spec?

What are you trying to do scientifically? Do you want to know the mean number of reads supporting the reference allele and the mean number of reads supporting the alternate allele, at each site? That would be this:

mt = mt.annotate_rows(
    mean_ADs = hl.agg.array_agg(hl.agg.mean, mt.AD)
)

This produces a new annotation, meanADs, which is an array. meanADs[0] is the mean of mt.AD[0] across every sample at this site.

If all your variants are biallelic you could also write this:

mt = mt.annotate_rows(
    mean_ref_AD = hl.agg.mean(mt.AD[0]),
    mean_alt_AD = hl.agg.mean(mt.AD[1])
)

Usually mt.DP == hl.sum(mt.AD) is not the same, since DP includes both supportive and non-supportive reads.

1 Like