I was wondering if it is possible to import non-GATK VCFs into Hail. Our VCF files were generated by Illumina Services using Isaac variant caller which doesn’t have the PL values. We were wondering if we could change the Hail code and skip checking the PL field, but we wonder if that would impact the downstream analysis. What is your recommendation in this case?
Hi Wendy,
It’s entirely possible to use non-GATK VCFs. Much of the infrastructure in the 0.1 version was built for the GT/AD/DP/GQ/PL GATK genotypes, but this is getting relaxed in 0.2. For now, you can either use the default VCF import (which will read whichever of these fields are present, and ignore the rest) or the generic=True import mode, which will read whatever fields are present as a generic struct. This second mode can cause downstream operations to be pretty slow, though.
If you can provide a bit more information about the kinds of analyses you’re looking to do, I can advise better!
We tried generic=True and it seems to work with some functions but summarize() gives us errors.
We would like to be able to do quite a lot of things with it, like generate PCA plot, find singletons, association studies, etc.
vds.summarize()
Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in summarize
File “/hail/java.py”, line 121, in handle_py4j
hail.java.FatalError: HailException: genotype signature Genotype' required, found:Struct{
GT: Call,
AD: Array[Int32],
DP: Int32,
GQ: Int32,
DPF: Int32,
PF: Array[Int32]
}’
If you don’t need to use DPF and PF urgently, then using the non-generic (default) import will work. PLs aren’t used for many methods in Hail at the moment.
Thanks! My colleague worked out that PL would have to be in the header for the default import to work and now it seems most of the methods are working. We can start experimenting with a few things now