Import_vcf from an ugly file, FORMAT issues

So, I have a 80 GB VCF file that I understand now has lots of messy data. The FORMAT is GT:AD:DP:GQ:PGT:PID:PL but I was told that only GT:AD matter. When importing, it is always failing thanks to PGT (5th field), sometimes it was empty (then I used find_replace=('::',':.:'), neaty!), but my last try threw that:

is.hail.utils.HailException: merged2.vcf.gz:column 1011: invalid character ',' in integer literal
... . ./.:0,0:0:0:.:.:. 0/0:35,0:35:63:0,63,945:.:. ./.:0,0:0:0:.:.:. 0/0:23 ...
                                        ^

I have this FORMAT definition:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

I’ve checked VCF doc, just in case.
I’m doing:

hl.import_vcf('gs://....vcf.gz',min_partitions=200,force_bgz=True,array_elements_required = False,find_replace=('::',':.:')).write("gs://...out..mt", overwrite=True)

Anyway to circumvent this issue, or better, importing only GT:AD ?
Many thanks in advance, Alan

Do all of the PGT fields have 3 comma separated values? Or does the number of values vary?

If you set PGT's format to Number=. thus indicating its a variable length array, does it read correctly?

@johnc1231 Sometimes it’s empty (’::’), so I use find_replace=('::',':.:') to address these cases.
@danking, do you mean change the vcf.gz file? Sorry, I understand and can try what you mean, just don’t know where to apply it.

Yeah, I mean modify the header of the VCF. It looks like your PGT is an array of integer, you should be able to parse that correctly with this header line:

##FORMAT=<ID=PGT,Number=.,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

Though, I must admit I don’t know how to interpret [0, 63, 945] as a phased genotype.

And I don’t need it either, so told me my collaborator, who created these file 2 years ago.
I’m reading the import_vcf docs hoping to understand what can I do with parameter call_fields. Could it work? Examples?

call_fields let’s you parse more fields as calls, but I don’t think it will let you skip a bad field. It still needs to parse as something.

I think I’d try to get find_replace working with regexps. Probably you can match your genotype records with:

([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*)

Then you can just select out the ones you want:

\0:\1

The patterns should follow https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

Thanks @danking. I’ve tweaked the regexp, you can see it here https://regex101.com/r/Q2JT1D/1

find_replace=(r"""\t([^: ]{1,5}):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: \t]*)""", r'\1:\2')

But still failing to do the proper replacement.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 20 times, most recent failure: Lost task 3.19 in stage 1.0 (TID 52, vep-hail3-w-5.europe-west2-b.c.alan-hail-test.internal, executor 4): is.hail.utils.HailException: merged2.vcf.gz:column 7073: unexpected end of line
... 21:21:21:21:21:21:21:21:21:2 ./.:.:.
                                        ^
offending line: 1	3566631	rs3754322721:21:21:21:21:21:21:21:21:21:21:21:21:2...

I understand how powerful it can be, I may just need to get the right regexp. I’ll keep trying.

I created an small VCF, with only 10 rows from my big VCF file. Then I did modified the PGT format, as suggested, using a ‘.’ (a dot). But running, I got this error:

FatalError: HailException: fields in 'call_fields' must have 'Number' equal to 1.

Damn. Foiled by the spec again. I’m not sure off the top of my head what format type to use. If there’s a “string” format type you might try that.

Ah, java regex is weird, you have to use $n for replacement. The following will pick out the first two groups and set the other three to NA. You’ll want array_elements_required=False. I think you actually have more than five format fields, so you may need to add more groups.

find_replace=('([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+)',
              '$1:$2:.:.:.')

I’ve tried ‘.’, A, R and G (as I read VCFv4.2.pdf), but now show, same error: fields in 'call_fields' must have 'Number' equal to 1.

I thought I knew one or two things about regexp… Java is a new beast for me. Anyway, I’m testing right now with that:

... array_elements_required = False, find_replace=(r'(::|:\d+,\d+,\d+:)',r':.:')

It worked on small test set, now trying the big 80Gb one.

It did work to write my mt fite! Now it’s running VEP and this will take time…

Justo to understand something else: I’ve set min_partitions=200 but I got 617. How is Hail deciding that? And how many cores would you recommend to process this file with VEP?

And many thanks guys, hail Hail! I’m quite happy with what I’ve learnt so far.
PS: please all these questions can wait :wink:

Normally it splits your file into 128MB partitions. That’s ~80GB / 128MB = ~625. min_partitions just sets the minimum in case you want more/smaller partitions that the default.