Can you support Illumina DRAGEN msVCF?

Hi all.

I’m using DRAGEN to analyze WGS data of about 2,300 probands and their families.
Because of the scalability, we conducted joint analysis using iterative gVCF genotyper (IGG) pipeline and I got msVCF as joint output.
I tried to use Hail scripts written to analyze other pVCFs such as GATK or something else, but msVCF format is something different to which Hail can handle including vds.
msVCF looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1 sample2 sample3 sample4
chr22   10514964        .       A       T       21.67   PASS    AC=7;AN=4148;NS=2317;NS_GT=2074;NS_NOGT=63;NS_NODATA=180;IC=0.28;HWE=0.0051;ExcHet=1;HWEc2=0    GT:GQ:AD:FT:LPL:LAA     0/0:3:1:LowDepth:0:.    0/0:24:24:PASS:0:.      0/0:23:16:PASS:0:.      0/0:23:23:PASS:0:.
chr22   10514994        .       G       A       35.86   PASS    AC=1065;AN=3820;NS=2317;NS_GT=1910;NS_NOGT=236;NS_NODATA=171;IC=0.33;HWE=5.6e-45;ExcHet=1;HWEc2=0       GT:GQ:AD:FT:LPL:LAA     ./.:0:0:LowDepth;LowGQ:0:.      0/0:0:19:LowGQ:0:.      0/1:3:8,3:PASS:35,0,15:1        0/0:23:23:PASS:0:.
chr22   10515037        .       AAAT    A       6.98    PASS    AC=8;AN=3200;NS=2317;NS_GT=1600;NS_NOGT=569;NS_NODATA=148;IC=0.25;HWE=0.0087;ExcHet=1;HWEc2=0   GT:GQ:AD:FT:LPL:LAA     0/0:3:1:LowDepth:0:.    0/0:0:18:LowGQ:0:.      0/0:7:16:PASS:0:.       ./.:0:4:LowGQ:0:.

Now, I am using custom codes below to use Hail:

mt = hl.read_matrix_table(f'{i_dir}/')
mt = mt.annotate_entries(DP=hl.sum(mt.AD),
                         .when(mt.GT.is_hom_ref(), hl.array([mt.AD[0], 0]))
                         .when(hl.is_missing(mt.GT) & hl.is_missing(mt.AD), hl.missing(hl.tarray(hl.tint32)))
                         .when(hl.is_missing(mt.GT) & ~hl.is_missing(mt.AD), hl.array([mt.AD[0], 0])))

mt.write(i_dir + 'Inputs/' + project + '_beforeQC_' + date +'.mt', overwrite=True)

So, I hope Hail supports DRAGEN msVCF.
Could you please consider this matter?
I’m attaching some a related link, so please check it together.

Please consider it positively.

Thank you.

I have successfully used HAIL to analyze a msVCF (generated by the iterative gVCF genotyper) with several thousand samples. What exactly is the problem you are having?

Hi @DBScan

Thanks for your reply!

Actually, I am analyzing de novo variants with our msVCF.
So, I need complete PL (3 elements stands for how ref, het and hom var) to run de_novo() function. Because de_novo() function needs parental PL to calculate probabilities that the genotype of child is really true de novo variant.
But msVCF genotyped using IGG has only one PL element for entries whose GT is hom ref.

Did you analyze DNVs with your msVCF?
If so, can you tell me how you handled the lacking PL?


I see what you would like to achieve now. You would have to rerun the iterative gVCF genotyper with the parameter --gg-remove-nonref false, then you get three values for LPL (Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF).

chr1    10109   .       AACCCT  A,<NON_REF>     .       PASS    AC=1,0;AN=6;NS=3;NS_GT=3;NS_NOGT=0;NS_NODATA=0;GAC=1,0;GAN=6;GNS=3;GNS_GT=3;GNS_NOGT=0;GNS_NODATA=0     GT:GQ:AD:FT:LPL:LAA     0/1:8:2,1,0:PASS:25,0,5,29,9,37:1,2     0/0:5:101,0,13:PASS:0,5,189:2   0/0:5:84,0,10:PASS:0,5,101:2