Import_vcf from an ugly file, FORMAT issues

So, I have a 80 GB VCF file that I understand now has lots of messy data. The FORMAT is GT:AD:DP:GQ:PGT:PID:PL but I was told that only GT:AD matter. When importing, it is always failing thanks to PGT (5th field), sometimes it was empty (then I used find_replace=('::',':.:'), neaty!), but my last try threw that:

is.hail.utils.HailException: merged2.vcf.gz:column 1011: invalid character ',' in integer literal
... . ./.:0,0:0:0:.:.:. 0/0:35,0:35:63:0,63,945:.:. ./.:0,0:0:0:.:.:. 0/0:23 ...
                                        ^

I have this FORMAT definition:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

I’ve checked VCF doc, just in case.
I’m doing:

hl.import_vcf('gs://....vcf.gz',min_partitions=200,force_bgz=True,array_elements_required = False,find_replace=('::',':.:')).write("gs://...out..mt", overwrite=True)

Anyway to circumvent this issue, or better, importing only GT:AD ?
Many thanks in advance, Alan

Do all of the PGT fields have 3 comma separated values? Or does the number of values vary?

If you set PGT's format to Number=. thus indicating its a variable length array, does it read correctly?

@johnc1231 Sometimes it’s empty (’::’), so I use find_replace=('::',':.:') to address these cases.
@danking, do you mean change the vcf.gz file? Sorry, I understand and can try what you mean, just don’t know where to apply it.

Yeah, I mean modify the header of the VCF. It looks like your PGT is an array of integer, you should be able to parse that correctly with this header line:

##FORMAT=<ID=PGT,Number=.,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

Though, I must admit I don’t know how to interpret [0, 63, 945] as a phased genotype.

And I don’t need it either, so told me my collaborator, who created these file 2 years ago.
I’m reading the import_vcf docs hoping to understand what can I do with parameter call_fields. Could it work? Examples?

call_fields let’s you parse more fields as calls, but I don’t think it will let you skip a bad field. It still needs to parse as something.

I think I’d try to get find_replace working with regexps. Probably you can match your genotype records with:

([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*)

Then you can just select out the ones you want:

\0:\1

The patterns should follow https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

Thanks @danking. I’ve tweaked the regexp, you can see it here https://regex101.com/r/Q2JT1D/1

find_replace=(r"""\t([^: ]{1,5}):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: \t]*)""", r'\1:\2')

But still failing to do the proper replacement.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 20 times, most recent failure: Lost task 3.19 in stage 1.0 (TID 52, vep-hail3-w-5.europe-west2-b.c.alan-hail-test.internal, executor 4): is.hail.utils.HailException: merged2.vcf.gz:column 7073: unexpected end of line
... 21:21:21:21:21:21:21:21:21:2 ./.:.:.
                                        ^
offending line: 1	3566631	rs3754322721:21:21:21:21:21:21:21:21:21:21:21:21:2...

I understand how powerful it can be, I may just need to get the right regexp. I’ll keep trying.

I created an small VCF, with only 10 rows from my big VCF file. Then I did modified the PGT format, as suggested, using a ‘.’ (a dot). But running, I got this error:

FatalError: HailException: fields in 'call_fields' must have 'Number' equal to 1.

Damn. Foiled by the spec again. I’m not sure off the top of my head what format type to use. If there’s a “string” format type you might try that.

Ah, java regex is weird, you have to use $n for replacement. The following will pick out the first two groups and set the other three to NA. You’ll want array_elements_required=False. I think you actually have more than five format fields, so you may need to add more groups.

find_replace=('([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+)',
              '$1:$2:.:.:.')

I’ve tried ‘.’, A, R and G (as I read VCFv4.2.pdf), but now show, same error: fields in 'call_fields' must have 'Number' equal to 1.

I thought I knew one or two things about regexp… Java is a new beast for me. Anyway, I’m testing right now with that:

... array_elements_required = False, find_replace=(r'(::|:\d+,\d+,\d+:)',r':.:')

It worked on small test set, now trying the big 80Gb one.

It did work to write my mt fite! Now it’s running VEP and this will take time…

Justo to understand something else: I’ve set min_partitions=200 but I got 617. How is Hail deciding that? And how many cores would you recommend to process this file with VEP?

And many thanks guys, hail Hail! I’m quite happy with what I’ve learnt so far.
PS: please all these questions can wait :wink:

Normally it splits your file into 128MB partitions. That’s ~80GB / 128MB = ~625. min_partitions just sets the minimum in case you want more/smaller partitions that the default.

Sorry to push this ‘dead’ thread, but I’m facing a similar issue where hail complains about PGT (on hail 0.2.11). I’m trying to get MNV from my vcf using the scripts provided at GitHub - macarthur-lab/gnomad_mnv: Multi-nucleotide variants (MNVs) in gnomAD 2.1. The regex during import didn’t help me out. Also, my INFO in vcf is as follows - GT:AD:DP:GQ:PL but my header does describe PGT, just my info doesn’t have it.

Taking a peek at get_mnv.py (The first of many scripts) I changed the import vcf command from vcf = hl.import_vcf(sys.argv[1], call_fields=["GT"]) to vcf = hl.import_vcf(sys.argv[1]) and it accepted my file, though I’m not sure if it screwed up with the analysis (the mt table thankfully wasn’t empty, but I’m still skeptical). If my issue is separate, do let me know and I would make another thread.

This may be the problem. That’s super old.

That’s what the tutorial recommends.

Ah, I see. Hail scripts are back-compatible within the life of 0.2, so I think you should try with the latest release. There’s been a lot of development in the last 2 years! We also don’t really have the resources to debug old versions, since quite often we end up rediscovering bugs that have since been fixed.

Hmm, makes sense. I’ll try running it on the latest version and check it out.