Hi Hail team,
My team try to use hl.de_novo()
to call de novo variants from our trio base cohort.
But we figure out that the number of call is relatively low than expected. (We expected 60~120 de novo variants per family) Then, we visualized all de novo call using IGV, and we can not find any true variants
And during visualization, we figure out that all the variants we got from hl.de_novo() are located in difficult regions, such as long poly-A and/or repeat regions.
So, we trace back to the joint-called VCF. We compare the joint-called VCF from GATK genotypegvcf and hail VDS new_combiner(), and we figure out that it might cause by PL values.
This is the joint-VCF we got from hl.vds.new_combiner():
And this is the same position of joint-genotyping VCF from GATK genotypegvcf
As we can see, joint-VCF from GATK got all PL scores from GVCF, but hail’s new_combiner() treat 0/0 samples as missing.
I think it might cause problem when we try to call de novo, because when father and mother are 0/0 or 0|0, the PL score are missing in hail joint-call VCF.
Then it will not output as de novo call due to the “kid/dad/mom_linear_pl” part.
Here is the hl.de_novo() code that consider PL values from both Father, Mother, and Proband:
kid = tm.proband_entry
dad = tm.father_entry
mom = tm.mother_entry
kid_linear_pl = 10 ** (-kid.PL / 10)
kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)
dad_linear_pl = 10 ** (-dad.PL / 10)
dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)
mom_linear_pl = 10 ** (-mom.PL / 10)
mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)
kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
dp_ratio = kid.DP / (dad.DP + mom.DP)
So, I’m wondering the reason Hail treat PL in 0/0 (or be more specific, allele [“ref”,“<NON_REF>”] from GVCF) as missing?
Thanks!
Po-Ying