I’m using GLnexus to joint-call my gvcfs, and I would like to import my pVCF into Hail to do further filtering. But some of GT like “./1”, “./0” will show as NA in Hail. Is there any possible way we can config to deal with this problem? Thanks!
I was working with the 2019 release of the SPARK data and ran my normal QC pipeline and ended up with a lot fewer indels than I expected. I looked into it and realized that all the genotypes at “MONOALLELIC” sites were being set to missing by Hail, so I had to figure out something to handle it.
I didn’t ultimately use the GLnexus version of the data (we ended up deciding to reprocess with GATK), but I found myself doing things like:
# This function looks for variants with "MONOALLELIC" in the filter field and:
# -If the genotype is missing, it makes it homref or het (homvar calls don't appear as missing)
# -Sets a value for the number of ref reads so that calls won't be filtered out later
def handle_monoallelics(mt):
mt = mt.annotate_entries(GT = ( hl.case()
.when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.GT) & hl.is_missing(mt.AD[0]) & (mt.AD[1] == 0),
hl.Call([0,0]))
.when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.GT) & hl.is_missing(mt.AD[0]) & (mt.AD[1] > 0),
hl.Call([0,1]))
.default(mt.GT) ),
AD = ( hl.case()
.when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.AD[0]),
[mt.DP - mt.AD[1], mt.AD[1]] )
.default(mt.AD) )
)
return (mt)
This recovered my indels but I realized later it wasn’t doing everything I needed for variant call rate calculations, so I also did:
# This is to fix the no-calls where the RNC is ["-."] (a sample's call opposite its monoallelic variant)
mt = mt.annotate_entries(GT = hl.cond( (mt.RNC.size() == 1) & (mt.RNC[0] == "-.") & (hl.is_missing(mt.GT)) &
(mt.AD[0] > 0) & (mt.AD[1] == 0),
hl.Call([0,0]),
mt.GT ),
AD = hl.cond( (mt.RNC.size() == 1) & (mt.RNC[0] == "-.") & (hl.is_missing(mt.GT)) &
(mt.AD[0] > 0) & (mt.AD[1] == 0),
[mt.DP, 0],
mt.AD )
)
So I don’t know if the format of the data you’re working with is exactly the same as this, but it’s example of how I tried to handle the issue you’re talking about.
There are some cases, such as ./., ./0 and ./1, I need to handle.
I want to do gene-base burden test, so I need to calculate hetero-ALT AC and homo-ALT AC for each variants.
I’m really appreciate your help! It’s definitely a way I can try. Thanks a lot!