A very useful metric for variant QC is the ABHet ratio. Variants with ABHet < 0.2 and > 0.8 are used to filter out false positives. More conservative cutoffs (0.25 and 0.75) are also used.
ABHet is a variant-level annotation that aims to estimate if biallelic variants match expected allelic ratios. An ideal heterozygous variant will have a value of close to 0.5 and an ideal homozygous variant will have a value of close to 1.0. ABHet is calculated for a variant based on all samples in the VCF which are not homozygous at this site.
ABHet=# REF reads from heterozygous samples/# REF + ALT reads from heterozygous samples
What would be the best approach for calculating this with Hail? Could this be added to the variantQC method?
Thank you for your question, @jjfarrell ! II do also use allele balances as a QC metric for my datasets.
I do have a snippet that I use when analysing sequencing data which can be found here.
#create an allele balance annotation
mt= mt.annotate_entries(AB = (mt.AD[1] / hl.sum(mt.AD) ))
#set filter condition for AB. So if you want to be less or more stringent you can pick your balances accordingly. I've included both hom-ref and hom-alt in my filtering step here
filter_condition_ab = ((mt.GT.is_hom_ref() & (mt.AB <= 0.1)) |
(mt.GT.is_het() & (mt.AB >= 0.25) & (mt.AB <= 0.75)) |
(mt.GT.is_hom_var() & (mt.AB >= 0.9)))
fraction_filtered = mt.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt = mt.filter_entries( (mt.GQ>=20) &
(mt.GQ >= 10) &
((mt.GT.is_hom_ref() & (mt.AB <= 0.1)) |
(mt.GT.is_het() & (mt.AB >= 0.25) & (mt.AB <= 0.75)) |
(mt.GT.is_hom_var() & (mt.AB >= 0.9))))
I hope that this is helpful
And to note, our developers have been flagged to see if we can include this into a future iteration of Hail.
@kumarveerapen I think the solution we want here is a variant-level metric which is the sum of ref reads in het samples divided by the sum of alt reads in het samples.
We probably want this to compute a ratio per alternate allele, actually.
@tpoterba @kumarveerapen
yes, we typically use this at the variant level rather than the GT level as Tim suggests. Two variant level metrics could be defined: An AS_ABHet (Allele Specific ABHet) and an ABHet (defined as ratio of all #Alt_reads to #homozygous reference reads). We run variantQC after splitting the multiallelic so either would work.