So, I have a 80 GB VCF file that I understand now has lots of messy data. The FORMAT is GT:AD:DP:GQ:PGT:PID:PL
but I was told that only GT:AD
matter. When importing, it is always failing thanks to PGT (5th field), sometimes it was empty (then I used find_replace=('::',':.:')
, neaty!), but my last try threw that:
is.hail.utils.HailException: merged2.vcf.gz:column 1011: invalid character ',' in integer literal
... . ./.:0,0:0:0:.:.:. 0/0:35,0:35:63:0,63,945:.:. ./.:0,0:0:0:.:.:. 0/0:23 ...
^
I have this FORMAT definition:
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
I’ve checked VCF doc, just in case.
I’m doing:
hl.import_vcf('gs://....vcf.gz',min_partitions=200,force_bgz=True,array_elements_required = False,find_replace=('::',':.:')).write("gs://...out..mt", overwrite=True)
Anyway to circumvent this issue, or better, importing only GT:AD
?
Many thanks in advance, Alan