Matrix table error

I’m trying to build a reference table using a different reference genome. Variants were called with bcftools mpileup, bcftools call, and bcftools filter. I didn’t receive any errors on reading the ref from a fasta file (this is for the plant species Sorghum bicolor) nor did I receive any errors on importing a vcf file. When i try the count method, however, I get the error below (the full line from hail.log is at the bottom)
Jan Erik

IN: first30.count()
OUT: FatalError: HailException: first_30_merged.vcf:column 180: invalid character ‘.’ in integer literal
… 19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,. 1/1:100,18,0,.,.,. 1/1:143,27, …
offending line: Chr01 74 . G C,A 127 LOWQUAL VDB=0.00595195;SGB=-0.680642;MQ…
see the Hail log for the full offending line

Steps leading up to error:

IN: sb2 = hl.ReferenceGenome.from_fasta_file(name=‘sb2’, fasta_file=’/d1/sorghum/ref/v2/Sbicolor_255_v2.0.fa’, index_file=’/d1/sorghum/ref/v2/Sbicolor_255_v2.0.fa.fai’)

IN: sb2
OUT: ReferenceGenome(name=sb2, contigs=[‘Chr01’, ‘Chr02’, ‘Chr03’, ‘Chr04’, ‘Chr05’, ‘Chr06’, ‘Chr07’, ‘Chr08’, ‘Chr09’, ‘Chr10’, ‘super_10’, ‘super_11’, ‘super_12’, ‘super_13’, ‘super_14’, ‘super_15’, ‘super_16’, ‘super_17’, ‘super_18’, ‘super_19’, ‘super_20’, ‘super_21’, ‘super_22’, ‘super_23’, ‘super_24’, ‘super_25’, ‘super_26’, ‘super_27’, ‘super_28’, ‘super_29’, ‘super_30’, … (many extra contigs)

IN: first30 = hl.import_vcf(’/d1/sorghum/vcfs/first_30_merged.vcf’,reference_genome=‘sb2’)
OUT: 2018-04-26 10:48:11 Hail: INFO: Ordering unsorted dataset with network shuffle

IN: first30.describe()
OUT: ----------------------------------------
Global fields:

Column fields:
‘s’: str

Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
INDEL: bool,
IDV: int32,
IMF: float64,
DP: int32,

From log file:

at fir
st_30_merged.vcf:column 180: invalid character ‘.’ in integer literal
… 19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,. 1/1:100,18,0,.,.,. 1/1:143,27, …
offending line: Chr01 74 . G C,A 127 LOWQUAL VDB=0.00
.730769;DP=309;DP4=9,1,214,19;AN=54;AC=53,1 GT:PL 1/1:130,36,0,.,.,.
1/1:100,18,0,.,.,. 1/1:143,27,0,.,.,. 1/1:140,27,0,.,.,. 1/1:131,
15,0,.,.,. 1/1:125,18,0,.,.,. 1/1:109,12,0,.,.,. 1/1:72,18,0,.,.,
. 1/1:91,8,0,.,.,. 1/1:71,29,20,.,.,. 1/2:158,58,40,114,0,108
1/1:119,9,0,.,.,. 1/1:101,32,17,.,.,. 1/1:101,20,2,.,.,. 1/1:156,
21,2,.,.,. 1/1:78,16,3,.,.,. 1/1:124,15,0,.,.,. 1/1:64,18,0,.,.,
. 1/1:131,24,0,.,.,. 1/1:144,35,24,.,.,. 1/1:109,33,3,.,.,.
1/1:129,32,5,.,.,. 1/1:40,12,0,.,.,. 1/1:138,21,0,.,.,. 1/1:101,15,0,.,.,. ./.:. 1/1:149,45,0,.,.,. 1/1:127,21,0,.,.,.

This is awesome! First time Hail has been used for non-human analysis, I think :slight_smile:. I think the reference genome is fine, but there’s another problem…

Your VCF has some weird patterns that don’t fall under the spec, which we don’t currently support. The VCF spec indicates that the individual elements of array format fields (like PL here) cannot be missing, but your data has PL fields like 143,27,0,.,.,.. Would it be possible for you to fix these up with another tool before importing? I think the two best options are:

  1. make the entire field missing when this happens
  2. fill in the missing entries with a “zero value” like 999

Tim: Thanks for your interest and suggestions. It was indeed awesome that I was able to bring the sorghum reference genome in to hail with little effort.

I used bcftools to call the variants because I’m trying to replicate some previous results. For the purpose of working with hail, however, I’m going to take a little detour and call the same samples using GATK and try importing the GATK generated VCF file. That way I can make sure that hail will work for me before I start messing with the vcf files.

That will probably be easiest, though probably annoying to recall.

Another thing – is Sorghum diploid, as it looks from the VCF line in the error message? Hail doesn’t currently support ploidy > 2, though we’ll add this at some point.

Good progress: I generated a new VCF using GATK’s haploTypeCaller and was able to import the resulting variants into a data matrix. There were a few multi-allelic loci which I had to eliminate before I was able to use the variant_qc method. From there I was able to complete all of the examples from the Quality Control section of the 01-genome-wide-association tutorial with my data. Nice!

Sorghum is indeed diploid.

Excellent! Let us know if you run into any other problems.

Just met the VCFParseError: ploidy > 2 not supported when we run hail 0.2.40 on a gatk4-mutect2 generated VCF. Wondering is there any time frame to support ploidy > 2 VCF since filtering out those multi-allelic variants does not seem to be reasonable in somatic calling scenario.

Your question is different than the earlier question in this thread, and it’s about Hail 0.2. Please make a new post for in the Help [0.2] category.

Thanks for the suggestion; just did at VCFParseError: ploidy > 2