Matrix table error

JanErik · April 26, 2018, 3:18pm

I’m trying to build a reference table using a different reference genome. Variants were called with bcftools mpileup, bcftools call, and bcftools filter. I didn’t receive any errors on reading the ref from a fasta file (this is for the plant species Sorghum bicolor) nor did I receive any errors on importing a vcf file. When i try the count method, however, I get the error below (the full line from hail.log is at the bottom)
Cheers,
Jan Erik

IN: first30.count()
OUT: FatalError: HailException: first_30_merged.vcf:column 180: invalid character ‘.’ in integer literal
… 19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,. 1/1:100,18,0,.,.,. 1/1:143,27, …
^
offending line: Chr01 74 . G C,A 127 LOWQUAL VDB=0.00595195;SGB=-0.680642;MQ…
see the Hail log for the full offending line

Steps leading up to error:

IN: sb2 = hl.ReferenceGenome.from_fasta_file(name=‘sb2’, fasta_file=’/d1/sorghum/ref/v2/Sbicolor_255_v2.0.fa’, index_file=’/d1/sorghum/ref/v2/Sbicolor_255_v2.0.fa.fai’)

IN: sb2
OUT: ReferenceGenome(name=sb2, contigs=[‘Chr01’, ‘Chr02’, ‘Chr03’, ‘Chr04’, ‘Chr05’, ‘Chr06’, ‘Chr07’, ‘Chr08’, ‘Chr09’, ‘Chr10’, ‘super_10’, ‘super_11’, ‘super_12’, ‘super_13’, ‘super_14’, ‘super_15’, ‘super_16’, ‘super_17’, ‘super_18’, ‘super_19’, ‘super_20’, ‘super_21’, ‘super_22’, ‘super_23’, ‘super_24’, ‘super_25’, ‘super_26’, ‘super_27’, ‘super_28’, ‘super_29’, ‘super_30’, … (many extra contigs)

IN: first30 = hl.import_vcf(’/d1/sorghum/vcfs/first_30_merged.vcf’,reference_genome=‘sb2’)
OUT: 2018-04-26 10:48:11 Hail: INFO: Ordering unsorted dataset with network shuffle

IN: first30.describe()
OUT: ----------------------------------------
Global fields:
None

Column fields:
‘s’: str

Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
INDEL: bool,
IDV: int32,
IMF: float64,
DP: int32,
…

From log file:

at java.lang.Thread.run(Thread.java:748)is.hail.utils.HailException: fir
st_30_merged.vcf:column 180: invalid character ‘.’ in integer literal
… 19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,. 1/1:100,18,0,.,.,. 1/1:143,27, …
^
offending line: Chr01 74 . G C,A 127 LOWQUAL VDB=0.00
595195;SGB=-0.680642;MQSB=1;MQ0F=0.0769231;MQ=17;RPB=0.961538;MQB=0.730769;BQB=0
.730769;DP=309;DP4=9,1,214,19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,.
1/1:100,18,0,.,.,. 1/1:143,27,0,.,.,. 1/1:140,27,0,.,.,. 1/1:131,
15,0,.,.,. 1/1:125,18,0,.,.,. 1/1:109,12,0,.,.,. 1/1:72,18,0,.,.,
. 1/1:91,8,0,.,.,. 1/1:71,29,20,.,.,. 1/2:158,58,40,114,0,108
1/1:119,9,0,.,.,. 1/1:101,32,17,.,.,. 1/1:101,20,2,.,.,. 1/1:156,
21,2,.,.,. 1/1:78,16,3,.,.,. 1/1:124,15,0,.,.,. 1/1:64,18,0,.,.,
. 1/1:131,24,0,.,.,. 1/1:144,35,24,.,.,. 1/1:109,33,3,.,.,.
1/1:129,32,5,.,.,. 1/1:40,12,0,.,.,. 1/1:138,21,0,.,.,. 1/1:101,15,0,.,.,. ./.:. 1/1:149,45,0,.,.,. 1/1:127,21,0,.,.,.

tpoterba · April 26, 2018, 4:19pm

This is awesome! First time Hail has been used for non-human analysis, I think . I think the reference genome is fine, but there’s another problem…

Your VCF has some weird patterns that don’t fall under the spec, which we don’t currently support. The VCF spec indicates that the individual elements of array format fields (like PL here) cannot be missing, but your data has PL fields like 143,27,0,.,.,.. Would it be possible for you to fix these up with another tool before importing? I think the two best options are:

make the entire field missing when this happens
fill in the missing entries with a “zero value” like 999

JanErik · April 26, 2018, 5:36pm

Tim: Thanks for your interest and suggestions. It was indeed awesome that I was able to bring the sorghum reference genome in to hail with little effort.

I used bcftools to call the variants because I’m trying to replicate some previous results. For the purpose of working with hail, however, I’m going to take a little detour and call the same samples using GATK and try importing the GATK generated VCF file. That way I can make sure that hail will work for me before I start messing with the vcf files.

tpoterba · April 27, 2018, 2:03am

That will probably be easiest, though probably annoying to recall.

Another thing – is Sorghum diploid, as it looks from the VCF line in the error message? Hail doesn’t currently support ploidy > 2, though we’ll add this at some point.

JanErik · April 29, 2018, 9:57pm

Good progress: I generated a new VCF using GATK’s haploTypeCaller and was able to import the resulting variants into a data matrix. There were a few multi-allelic loci which I had to eliminate before I was able to use the variant_qc method. From there I was able to complete all of the examples from the Quality Control section of the 01-genome-wide-association tutorial with my data. Nice!

Sorghum is indeed diploid.

tpoterba · April 30, 2018, 2:09pm

Excellent! Let us know if you run into any other problems.

obigbando · August 5, 2020, 3:28am

Just met the VCFParseError: ploidy > 2 not supported when we run hail 0.2.40 on a gatk4-mutect2 generated VCF. Wondering is there any time frame to support ploidy > 2 VCF since filtering out those multi-allelic variants does not seem to be reasonable in somatic calling scenario.

johnc1231 · August 5, 2020, 1:48pm

Your question is different than the earlier question in this thread, and it’s about Hail 0.2. Please make a new post for in the Help [0.2] category.

obigbando · August 6, 2020, 1:23am

Thanks for the suggestion; just did at VCFParseError: ploidy > 2

Topic		Replies	Views
Unable to write MatrixTable (VCFParseError) Hail Query & hailctl	3	373	July 20, 2022
Unable to create matrix table of gnomAD chr1, chr2 Hail Query & hailctl	2	296	January 3, 2023
Fail to read a T2T VCF Feature Requests	8	390	January 7, 2023
FatalError while importing vcf file Hail Query & hailctl	2	437	November 14, 2019
Unable to load VCF file Hail Query & hailctl	2	400	January 27, 2022

Matrix table error

IN: first30.describe() OUT: ---------------------------------------- Global fields: None

Column fields: ‘s’: str

Related topics

IN: first30.describe()
OUT: ----------------------------------------
Global fields:
None

Column fields:
‘s’: str