Importing VCFs where GT is 0/2 but AD has only 2 entries

I’m getting an error in after hc.import_vcf due to the check on whether the number of fields in AD matches the number of alleles implied from GT.

I’m just using vcf-merge to combine raw GATK calls so I assume that this is in compliance with VCF 4.2 format. Is there a way to work around this without simply dropping all those from the VCF file or patching to ignore that check?

Thanks

Hi Will,

I strongly recommend you do not ignore the warnings about invalid genotype fields as they are there to ensure your data is not corrupted!

If you want to ignore the checks Hail places on FORMAT fields, you can use import_vcf with the arguments generic=True and specify the FORMAT fields you want to be treated as a genotype call with call_fields. The GT field is automatically imported as a genotype call.

See this discuss post for more information on working with generic genotype fields.

Best,
Jackie

In this case you may want to use the skip_bad_ad=True option on import_vcf - this will set all offending AD fields to missing. GT=0/2 with 2 entries for AD is indeed in violation of the 4.2 spec, though, I believe! Use Jackie’s advice if you want to keep them defined, it’ll be possible to use the expression language and annotate_genotypes_expr to get back to a real genotype.

This isn’t the first time we’ve seen problematic output from vcftools…

Thanks so much for the quick replies! I’ll test out those suggestions.

And you are right that AD is supposed to be length R (nAllele). The output from our GATK pipeline is set it to ‘.’ so I think vcf-merge is just preserving the ambiguity.

Anyway thank you again!