Writing VDS to Parquet produces NULL values for GT


#1

Hi, we are trying to read a non-GATK merged VCF (generated with Illumina Isaac pipeline) into Hail 0.1 and after having worked around the initial glitches by adding a definition for PL in the VCF header, we are finally able to run a few tests. However, when I try to import the VCF and write out resulting VDS as Parquet, I see NULL values for GT in the Impala table for loci that have valid calls in the VCF:

>vds_generic=hc.import_vcf("test.vcf.bgz",generic=True)
>print(vds_genric.genotype_schema)
Struct{GT:Call,DP:Int,DPF:Int,AD:Array[Int],GQ:Int,PF:Array[Int],PL:Array[Int]}

>vds_generic.write("sample.vds", parquet_genotypes=True)
>select count(distinct variant.start) from variants;
+-------------------------------+
| count(distinct variant.start) |
+-------------------------------+
| 2255                          |
+-------------------------------+

> select gt, count(*) from variants.gs group by gt;
+------+----------+
| gt   | count(*) |
+------+----------+
| NULL | 11575639 |
| 1    | 2637     |
+------+----------+

Any pointers on what I should try next would be very helpful.

Thanks,
Prachi


#2

Hi Prachi,
Hail makes certain assumptions about the relationships between the fields in the genotype. Unfortunately this isn’t well documented, and the computational model of Spark makes it very hard to automatically warn about them.

Could you try a few things?

>>> print(vds_generic.query_genotypes('gs.map(g => g.GT).counter()'))
>>> vds_generic=hc.import_vcf("test.vcf.bgz",generic=True)
>>> vds_generic.count()
>>> hc.report()

#3

Right after I posted this, I came across this post on converting Call to Genotype and that did the trick! I can now use the methods I was interested in.

>vds_result = vds_generic.annotate_genotypes_expr("g = g.GT.toGenotype()")


#4

you’ll have the methods you want, but you’ve only got the call. PL / DP / everything else are gone


#5

I think you can generate a genotype in expr with those fields - search “Genotype” here.

>>> vds_result = vds_generic.annotate_genotypes_expr("g = Genotype(v, g.GT, g.AD, g.DP, g.GQ, g.PL)")

#6

this is probably exactly like using the non-generic import.


#7

Thanks for the suggestions. Our VCF doesn’t have PL for the calls so some of the methods don’t work and GT is NULL in the Impala table if I import without generic=True.

I’m now trying out your suggestion on generating a genotype with the fields we have. PL will be NULL anyways but at least I can keep GQ, DP and AD. Thanks