Ok, I see where this could be coming from. We are running this as part of a test with Qubole (Zeppelin-based notebook) so I'll try to figure out how to get the logs.
In the meantime, here are some details on our FORMAT tags. We used agg to combine single-sample gVCFs (generated with Illumina Isaac pipeline) into a multi-sample VCF and since it's not a GATK VCF, some of the typical tags are missing. Here's an example:
GT:GQ:DP:DPF:AD:PF 0/0:60:22:.:.,.:. 0/0:100:20:.:.:.
Could any of these be an issue:
- PL is missing
- AD for hom ref calls is "." in some cases but ".,." in others
- Is it expected that even for hom ref calls, AD for all alleles in the variant will be included? I need to check if these are lost during aggregation with agg
Somewhat related, is there a way to load multiple VCFs without having to merge them into a multi-sample VCF? As the spec says, when I provide multiple VCFs to import_vcf, all VCFs are expected to be coming from same set of samples but we have one WGS VCF per sample and wanted to see if there is way to avoid merging them beforehand.
I've also tested ADAM a bit and I could merge individual VCFs into a single RDD/data frame after conversion to parquet but subsequent operations seem to be slower when compared to Hail.