Import GLnexus pVCF

Hi Hail team,

I’m using GLnexus to joint-call my gvcfs, and I would like to import my pVCF into Hail to do further filtering. But some of GT like “./1”, “./0” will show as NA in Hail. Is there any possible way we can config to deal with this problem? Thanks!

Best,
Po-Ying

Hey @poyingfu, thanks for your question.

How would you like to interpret a .? Should ./1 be 0/1 or 1/1?

This has to do with “MONOALLELIC” variants where GLnexus is trying to figure out what do with variants where a sample only has a copy number of 1 (Reading GLnexus pVCFs · dnanexus-rnd/GLnexus Wiki · GitHub).

I was working with the 2019 release of the SPARK data and ran my normal QC pipeline and ended up with a lot fewer indels than I expected. I looked into it and realized that all the genotypes at “MONOALLELIC” sites were being set to missing by Hail, so I had to figure out something to handle it.

I didn’t ultimately use the GLnexus version of the data (we ended up deciding to reprocess with GATK), but I found myself doing things like:

# This function looks for variants with "MONOALLELIC" in the filter field and:
# -If the genotype is missing, it makes it homref or het (homvar calls don't appear as missing)
# -Sets a value for the number of ref reads so that calls won't be filtered out later
def handle_monoallelics(mt):
    mt = mt.annotate_entries(GT = ( hl.case()
                               .when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.GT) & hl.is_missing(mt.AD[0]) & (mt.AD[1] == 0), 
                                    hl.Call([0,0]))
                               .when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.GT) & hl.is_missing(mt.AD[0]) & (mt.AD[1] > 0), 
                                    hl.Call([0,1]))
                               .default(mt.GT) ),
                           
                             AD = ( hl.case()
                               .when((mt.filters == {"MONOALLELIC"}) & hl.is_missing(mt.AD[0]), 
                                    [mt.DP - mt.AD[1], mt.AD[1]] )
                               .default(mt.AD) )
                            )
    return (mt)

This recovered my indels but I realized later it wasn’t doing everything I needed for variant call rate calculations, so I also did:

# This is to fix the no-calls where the RNC is ["-."] (a sample's call opposite its monoallelic variant)
mt = mt.annotate_entries(GT = hl.cond( (mt.RNC.size() == 1) & (mt.RNC[0] == "-.") & (hl.is_missing(mt.GT)) &
                                       (mt.AD[0] > 0) & (mt.AD[1] == 0), 
                                        hl.Call([0,0]), 
                                        mt.GT ),
                         
                         AD = hl.cond( (mt.RNC.size() == 1) & (mt.RNC[0] == "-.") & (hl.is_missing(mt.GT)) &
                                       (mt.AD[0] > 0) & (mt.AD[1] == 0), 
                                       [mt.DP, 0], 
                                        mt.AD )
                        )

So I don’t know if the format of the data you’re working with is exactly the same as this, but it’s example of how I tried to handle the issue you’re talking about.

2 Likes

Thanks Kyle! That’s what I want to do!!! The way you approach it is pretty amazing.

My data look like below:

There are some cases, such as ./., ./0 and ./1, I need to handle.

I want to do gene-base burden test, so I need to calculate hetero-ALT AC and homo-ALT AC for each variants.
I’m really appreciate your help! It’s definitely a way I can try. Thanks a lot!

1 Like