Import_vcf report error: fields in 'call_fields' must have 'Number' equal to 1

I have over 20,000 VCF shards (all have the same samples, were broken by variants).
I try to use hail import these VCFs and write to a big MT. (fileformat=VCFv4.2)

hl.import_vcf(‘gs://path/WGS.*.vcf.gz’, force=True).write(‘gs://path/step1/’, overwrite=True)
File “”, line 2, in import_vcf
Hail version: 0.2.96-39909e0a396f
Error summary: HailException: fields in ‘call_fields’ must have ‘Number’ equal to 1.

One of my VCF header FORMAT definition:

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">

Now I am confused why Hail complain this: fields in ‘call_fields’ must have ‘Number’ equal to 1. how could I parse certain fields to solve this?

Many thanks in advance, Shuang

Do you know why this is Number=.? That implies that you have an array of PGT values. That doesn’t make sense, there should be just one PGT (just like there’s one GT). What would it mean if a sample has two phased genotypes at the same site?

If that field should actually be Number=1, you can modify the VCF files to have the right header. You can also use the header_file parameter to specify an override header to be used for all VCF files. That way you only have to change one file rather than many.

1 Like

Hi @danking ,
Thank you very much. I edited:


change Number=. to =1. And use the header_file parameter to pass one edited header file for all my VCFs.
Hail successfully import VCF without complain.

However, while I am double-checking. I noticed one small issue.
I use bcftools view -h to get the header file then edit it. Although I only need to change FORMAT definition part, but the header file will contains all meta information, including processing, like: ##GATKCommandLine=<ID=ApplyVQSR,CommandLine="ApplyVQSR …,(lots of info, I ignore them here) it contains processing time, file name ect, they are different among all VCFs. if I pass one header file to all 20,000 vcfs, it will cause problem? If I remove 4 lines of ##GATKCommandLine records, all VCFs sharing the same header file. I think these 4 lines of records in header file for Hail import_vcf() is meaningless right?

Yeah, Hail doesn’t care about ##GATKCommandLine lines.

1 Like

Hi @danking, thanks a lot for your reply! problem solved! :smiley: