When Loading and aggregate VCF built with Illumina agg software from Strelka called Illumina gVCFs, via:
var_mt = hl.import_vcf(in_files, force_bgz=True)
we get the following error:
HailException: invalid genotype signature: expected signatures to be identical for all inputs.
s3://regap-183760095058-eu-west-2-data/private-preview/subset_10_22/20k_GRCH38_germlinechr20_13754086_18515298.vcf.bgz: struct{GT: call, DP: int32, DPF: int32, AD: array, GQ: int32, PF: array}
s3://regap-183760095058-eu-west-2-data/private-preview/subset_10_22/20k_GRCH38_germlinechr20_61920330_64334162.vcf.bgz: struct{GT: call, DP: int32, DPF: int32, AD: array, GQ: int32, PF: array, PL: array}
Is there a restriction on one, or more, or a combination of attributes that we need to follow.
Thanks
Mark
These are both valid VCFs to Hail if you import them individually, but Hail rejects importing multiple VCFs with the same import_vcf
invocation if their schemas differ. That’s what’s going on here – one of the vcfs has a PL, one doesn’t.
1 Like
I should also note that Hail doesn’t do nicely with gVCFs right now. One of our team members is working on a gVCF import/merging algorithm and sparse genotytpe matrix representation, though, which we expect to be usable and documented in a couple of months.
Excellent. Thanks for the clarification
also note that import_vcf
supports a list of files with the same samples and non-overlapping genomic intervals. It doesn’t support importing multiple single-sample [g]VCFs.
Well. We’re running against 20K aggregate gVCF , but look forward to the update