Thanks so much for supporting such a great tool. I am running into an issue trying to import multiple VCF shards from Sentieon’s implementation of GATK. Some of the shards import without issue. For others, there is an error thrown for single lines (complex indels):
mt = hl.import_vcf(in_path + ‘GVCFtyper-shard_2.vcf’,
reference_genome=‘GRCh38’, array_elements_required=False, skip_invalid_loci=True)
hail.utils.java.FatalError: HailException: GVCFtyper-shard_2.vcf:column 381286: empty integer field
… 0,0,0,0,0,0,0,0,0,0,0,0,0:18:99:.:. :.:.:.:.:. 0/1:8,6,0,0,0,0,0,0,0,0,0 …
offending line: chr1 25819429 . TGAGAGAGA AGAGAGAGA,TGAGAGAGAGAGAGAGAGA,TGAG…
see the Hail log for the full offending line
When I manually remove this line from the VCF, the file can be imported successfully, but this solution is slow and not scalable. Is there a better way to skip such lines and force import? Any insights would be greatly appreciated!
what is the carat pointing to?
It is pointing to the space in the middle of this part of the line “,0:18:99:.:. :.:.:.:.:.”
Sorry, actually pointing to the colon after the space.
This seems like an invalid VCF to me – a FORMAT field can’t start with
:. This is feedback you should give to Sentieon for sure! In the meantime, I don’t know of a way to ignore lines with bad data like this. Do you see it frequently?
Ah, thanks so much. I’ll definitely pass this along to Sentieon. This happened twice in the file attempted here, and I have a test running on 100 shards to see how frequently it appears. I’ll keep you posted.
Thanks! Sorry about the inconvenience.
If you have those specific patterns, when variant information does not apply to a specific sample, you can temporarily fix it with
sed and regular expressions to substitute
0, since Hail matches the data types declared in the header. Bcftools generates similar outputs. It’s a dirty trick but it does the job. Just be careful not to overwrite your actual data.
Quick update: the Sentieon team reports that this bug was fixed in a more recent version of the software and suggests two solutions: sed to replace ‘\t:’ with ‘\t./.:’ (diploids) or using bcftools view to fill 0s into missing genotypes (along the lines of cdeniz’s solutions). We’re using the former and haven’t tested the latter. Thanks everybody for your time!