Hail V01 to V02 missing genotypes?

knguyen142 · April 18, 2019, 2:45pm

Hi, I was validating some hail v02 code with v01 and am running into issues where hail v01 is saying some genotypes are missing while v02 says it is present, is this documented anywhere?
e.g. on the 1kg30variants vcf:

vds = hc.import_vcf(input_path, force_bgz=True)
vds_filtered = vds.filter_variants_list([Variant.parse('1:881918:G:A')] # rsid rs35471880
vds_filtered.export_genotypes('vds-genotypes.csv', 's', export_missing=True)

# shows NA19675, NA19679, NA20870, NA20872, NA20876

while for v02:

mt = hl.import_vcf(input_path).filter_rows(mt.rsid == 'rs35471880'))
mt.filter_entries(hl.is_missing(mt.GT))

# is empty

Am I missing something?

tpoterba · April 18, 2019, 2:48pm

I believe this is related to https://github.com/hail-is/hail/issues/3472

In 0.1 we enforced certain assumptions about genotypes in VCFs, like that DP >= sum(AD) and that the PL corresponding to the called genotype is 0. If any of these assumptions were violated, we set the entire genotype to missing.

In 0.2 we don’t make any assumptions, importing the data as-is.

knguyen142 · April 18, 2019, 2:55pm

Awesome, thanks for quick response as always, Tim. It makes sense that hail doesn’t automatically do any validity checks. Good to know it wasn’t us doing something dumb haha.

mwilson · May 9, 2019, 3:01pm

Hi Tim, sorry to resurrect this again… for the third time… but we want to explore using the hail 0.1 assumptions in our 0.2 pipeline. Would it be possible to get a list of the other assumptions 0.1 made in addition to DP >= sum(AD) and that the PL= 0 for the called genotype? I’ve reached out to DSP about why they suggested the those two specific assumptions through the GATK forum and heard back. I appreciate your patience with this!

tpoterba · May 9, 2019, 3:47pm

here: https://github.com/hail-is/hail/blob/0.1/src/main/scala/is/hail/io/vcf/LoadVCF.scala#L42

mwilson · May 9, 2019, 4:38pm

Thanks Tim!

Topic		Replies	Views
Exporting Hail MT to VCF - Missing Genotypes Hail Query & hailctl	11	250	May 8, 2024
Partially missing genotypes Hail Query & hailctl	3	463	November 15, 2019
Gentoype filtering - missing hom ref data Hail Query & hailctl	2	22	March 20, 2025
Genotypic Phase Hail Query & hailctl	4	599	August 21, 2023
Import GLnexus pVCF Science	3	669	October 28, 2021

Hail V01 to V02 missing genotypes?

Related topics