Hi, I was validating some hail v02 code with v01 and am running into issues where hail v01 is saying some genotypes are missing while v02 says it is present, is this documented anywhere?
e.g. on the 1kg30variants vcf:
vds = hc.import_vcf(input_path, force_bgz=True)
vds_filtered = vds.filter_variants_list([Variant.parse('1:881918:G:A')] # rsid rs35471880
vds_filtered.export_genotypes('vds-genotypes.csv', 's', export_missing=True)
# shows NA19675, NA19679, NA20870, NA20872, NA20876
while for v02:
mt = hl.import_vcf(input_path).filter_rows(mt.rsid == 'rs35471880'))
mt.filter_entries(hl.is_missing(mt.GT))
# is empty
Am I missing something?
I believe this is related to https://github.com/hail-is/hail/issues/3472
In 0.1 we enforced certain assumptions about genotypes in VCFs, like that DP >= sum(AD)
and that the PL
corresponding to the called genotype is 0
. If any of these assumptions were violated, we set the entire genotype to missing.
In 0.2 we don’t make any assumptions, importing the data as-is.
Awesome, thanks for quick response as always, Tim. It makes sense that hail doesn’t automatically do any validity checks. Good to know it wasn’t us doing something dumb haha.
Hi Tim, sorry to resurrect this again… for the third time… but we want to explore using the hail 0.1 assumptions in our 0.2 pipeline. Would it be possible to get a list of the other assumptions 0.1 made in addition to DP >= sum(AD)
and that the PL
= 0 for the called genotype? I’ve reached out to DSP about why they suggested the those two specific assumptions through the GATK forum and heard back. I appreciate your patience with this!