Import_vcf from an ugly file, FORMAT issues

alanwilter · May 8, 2020, 7:21pm

So, I have a 80 GB VCF file that I understand now has lots of messy data. The FORMAT is GT:AD:DP:GQ:PGT:PID:PL but I was told that only GT:AD matter. When importing, it is always failing thanks to PGT (5th field), sometimes it was empty (then I used find_replace=('::',':.:'), neaty!), but my last try threw that:

is.hail.utils.HailException: merged2.vcf.gz:column 1011: invalid character ',' in integer literal
... . ./.:0,0:0:0:.:.:. 0/0:35,0:35:63:0,63,945:.:. ./.:0,0:0:0:.:.:. 0/0:23 ...
                                        ^

I have this FORMAT definition:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

I’ve checked VCF doc, just in case.
I’m doing:

hl.import_vcf('gs://....vcf.gz',min_partitions=200,force_bgz=True,array_elements_required = False,find_replace=('::',':.:')).write("gs://...out..mt", overwrite=True)

Anyway to circumvent this issue, or better, importing only GT:AD ?
Many thanks in advance, Alan

johnc1231 · May 8, 2020, 7:25pm

Do all of the PGT fields have 3 comma separated values? Or does the number of values vary?

danking · May 8, 2020, 7:29pm

If you set PGT's format to Number=. thus indicating its a variable length array, does it read correctly?

alanwilter · May 8, 2020, 8:18pm

@johnc1231 Sometimes it’s empty (’::’), so I use find_replace=('::',':.:') to address these cases.
@danking, do you mean change the vcf.gz file? Sorry, I understand and can try what you mean, just don’t know where to apply it.

danking · May 8, 2020, 8:21pm

Yeah, I mean modify the header of the VCF. It looks like your PGT is an array of integer, you should be able to parse that correctly with this header line:

##FORMAT=<ID=PGT,Number=.,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

Though, I must admit I don’t know how to interpret [0, 63, 945] as a phased genotype.

alanwilter · May 8, 2020, 8:25pm

And I don’t need it either, so told me my collaborator, who created these file 2 years ago.
I’m reading the import_vcf docs hoping to understand what can I do with parameter call_fields. Could it work? Examples?

danking · May 8, 2020, 8:32pm

call_fields let’s you parse more fields as calls, but I don’t think it will let you skip a bad field. It still needs to parse as something.

I think I’d try to get find_replace working with regexps. Probably you can match your genotype records with:

([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*)

Then you can just select out the ones you want:

\0:\1

The patterns should follow https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

alanwilter · May 9, 2020, 7:40am

Thanks @danking. I’ve tweaked the regexp, you can see it here https://regex101.com/r/Q2JT1D/1

find_replace=(r"""\t([^: ]{1,5}):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: ]*):([^: \t]*)""", r'\1:\2')

But still failing to do the proper replacement.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 20 times, most recent failure: Lost task 3.19 in stage 1.0 (TID 52, vep-hail3-w-5.europe-west2-b.c.alan-hail-test.internal, executor 4): is.hail.utils.HailException: merged2.vcf.gz:column 7073: unexpected end of line
... 21:21:21:21:21:21:21:21:21:2 ./.:.:.
                                        ^
offending line: 1	3566631	rs3754322721:21:21:21:21:21:21:21:21:21:21:21:21:2...

I understand how powerful it can be, I may just need to get the right regexp. I’ll keep trying.

alanwilter · May 9, 2020, 11:27am

I created an small VCF, with only 10 rows from my big VCF file. Then I did modified the PGT format, as suggested, using a ‘.’ (a dot). But running, I got this error:

FatalError: HailException: fields in 'call_fields' must have 'Number' equal to 1.

danking · May 9, 2020, 3:23pm

Damn. Foiled by the spec again. I’m not sure off the top of my head what format type to use. If there’s a “string” format type you might try that.

danking · May 9, 2020, 4:45pm

Ah, java regex is weird, you have to use $n for replacement. The following will pick out the first two groups and set the other three to NA. You’ll want array_elements_required=False. I think you actually have more than five format fields, so you may need to add more groups.

find_replace=('([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+):([0-9/,.]+)',
              '$1:$2:.:.:.')

alanwilter · May 9, 2020, 6:11pm

I’ve tried ‘.’, A, R and G (as I read VCFv4.2.pdf), but now show, same error: fields in 'call_fields' must have 'Number' equal to 1.

alanwilter · May 9, 2020, 6:14pm

I thought I knew one or two things about regexp… Java is a new beast for me. Anyway, I’m testing right now with that:

... array_elements_required = False, find_replace=(r'(::|:\d+,\d+,\d+:)',r':.:')

It worked on small test set, now trying the big 80Gb one.

alanwilter · May 9, 2020, 8:47pm

It did work to write my mt fite! Now it’s running VEP and this will take time…

Justo to understand something else: I’ve set min_partitions=200 but I got 617. How is Hail deciding that? And how many cores would you recommend to process this file with VEP?

And many thanks guys, hail Hail! I’m quite happy with what I’ve learnt so far.
PS: please all these questions can wait

cseed · May 9, 2020, 9:29pm

Normally it splits your file into 128MB partitions. That’s ~80GB / 128MB = ~625. min_partitions just sets the minimum in case you want more/smaller partitions that the default.

kvn95ss · February 18, 2021, 4:39am

Sorry to push this ‘dead’ thread, but I’m facing a similar issue where hail complains about PGT (on hail 0.2.11). I’m trying to get MNV from my vcf using the scripts provided at GitHub - macarthur-lab/gnomad_mnv: Multi-nucleotide variants (MNVs) in gnomAD 2.1. The regex during import didn’t help me out. Also, my INFO in vcf is as follows - GT:AD:DP:GQ:PL but my header does describe PGT, just my info doesn’t have it.

Taking a peek at get_mnv.py (The first of many scripts) I changed the import vcf command from vcf = hl.import_vcf(sys.argv[1], call_fields=["GT"]) to vcf = hl.import_vcf(sys.argv[1]) and it accepted my file, though I’m not sure if it screwed up with the analysis (the mt table thankfully wasn’t empty, but I’m still skeptical). If my issue is separate, do let me know and I would make another thread.

tpoterba · February 18, 2021, 11:41am

This may be the problem. That’s super old.

kvn95ss · February 22, 2021, 10:06am

That’s what the tutorial recommends.

tpoterba · February 22, 2021, 1:41pm

Ah, I see. Hail scripts are back-compatible within the life of 0.2, so I think you should try with the latest release. There’s been a lot of development in the last 2 years! We also don’t really have the resources to debug old versions, since quite often we end up rediscovering bugs that have since been fixed.

kvn95ss · March 3, 2021, 5:25am

Hmm, makes sense. I’ll try running it on the latest version and check it out.

Topic		Replies	Views
Import_vcf report error: fields in 'call_fields' must have 'Number' equal to 1 Hail Query & hailctl	4	391	October 7, 2022
Importing VCFs where GT is 0/2 but AD has only 2 entries Help [0.1]	3	1252	June 9, 2017
Joint genotype example vcf file Hail Query & hailctl	10	1200	May 6, 2020
Import_vcf error: empty integer field Help [0.1]	8	895	September 6, 2018
Entry fields empty after import_vcf from 1000 genomes data Hail Query & hailctl	9	807	August 21, 2018

Import_vcf from an ugly file, FORMAT issues

Related topics