Exporting Hail MT to VCF - FORMAT columns missing

Lnizz_b · March 21, 2025, 12:40pm

Hello,
I am new to Hail and am exporting a Hail MT to VCF using the All of Us code snippets; the code runs fine however the FORMAT columns are omitted despite the necessary entry fields being present in the MatrixTable. Documentation warns this happens with Table objects but I have confirmed that I am using a MatrixTable. I am reading in the produced VCFs by concatenating the shards and veiwing with bcftools, rather than using hl.import_vcf as I want to bcftools for my downstream processing.

an exceprt ogf my code:

vcf_header = “FILEPATH/data/vcf_header.txt”
os.system(“gsutil cat FILEPATH/data/vcf_header.txt”)

##fileformat=VCFv4.2
##reference=gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
##FILTER=<ID=ExcessHet,Description=“Site has excess het value larger than the threshold”>
##FILTER=<ID=LowQual,Description=“Low quality”>
##FILTER=<ID=NO_HQ_GENOTYPES,Description=“Site has no high quality variant genotypes”>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=“Allelic depths for the ref and alt alleles in the order listed”>
##FORMAT=<ID=FT,Number=1,Type=String,Description=“Genotype Filter Field”>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=“Genotype Quality”>
##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description=“Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)”>
##INFO=<ID=AC,Number=A,Type=Integer,Description=“Allele count in genotypes, for each ALT allele, in the same order as listed”>
##INFO=<ID=AF,Number=A,Type=Float,Description=“Allele Frequency, for each ALT allele, in the same order as listed”>
##INFO=<ID=AN,Number=1,Type=Integer,Description=“Total number of alleles in called genotypes”>
##INFO=<ID=homozygote_count,Number=R,Type=Integer,Description=“Number of homozygotes per allele. One element per allele, including the reference.”>

metadata = hl.get_vcf_metadata(vcf_header)
mt_vcf = mt_vcf.repartition(50, shuffle=True)
print(type(mt_vcf))
mt_vcf.describe()

<class ‘hail.matrixtable.MatrixTable’>

Global fields:
None

Column fields:
** ‘s’: str**

Row fields:
** ‘locus’: locus**
** ‘alleles’: array**
** ‘filters’: set**
** ‘info’: struct {**
** AC: array, **
** AF: array, **
** AN: int32, **
** homozygote_count: array**
** }**

Entry fields:
** ‘GQ’: int32**
** ‘GT’: call**
** ‘AD’: array**
** ‘RGQ’: int32**
** ‘FT’: str**
** ‘PS’: int64**

Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]

out_vcf = f’{bucket}/data/fads_cluster.vcf.bgz’
hl.export_vcf(mt_vcf, out_vcf, parallel=“header_per_shard”, tabix = False, metadata=metadata)

ehigham · March 21, 2025, 7:40pm

Hi @Lnizz_b,
That’s troubling. I’m sorry you’re experiencing this issue. Which version of hail are you using and would you mind providing the header that hail generated?
Thanks,

Lnizz_b · March 23, 2025, 9:46pm

Thanks for responding! The code generates a folder full of .bgz files - one per shard as expected.

The head of each .bgz file resembles the below example: i have omitted the several hundred contig lines and filepaths. Additionally it is worth noting I am only running this on a couple of hundred variants - which may be a factor? I plan to expand it to a large set once this is working correctly.

As specified in the file the version of hail is 0.2.130.post1-c69cd67afb8b

Thanks!

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description=“All filters passed”>
##hailversion=0.2.130.post1-c69cd67afb8b
##FILTER=<ID=ExcessHet,Description=“Site has excess het value larger than the threshold”>
##FILTER=<ID=LowQual,Description=“Low quality”>
##FILTER=<ID=NO_HQ_GENOTYPES,Description=“Site has no high quality variant genotypes”>
##INFO=<ID=AC,Number=A,Type=Integer,Description=“Allele count in genotypes, for each ALT allele, in the same order as listed”>
##INFO=<ID=AF,Number=A,Type=Float,Description=“Allele Frequency, for each ALT allele, in the same order as listed”>
##INFO=<ID=AN,Number=1,Type=Integer,Description=“Total number of alleles in called genotypes”>
##INFO=<ID=homozygote_count,Number=R,Type=Integer,Description=“Number of homozygotes per allele. One element per allele, including the reference.”>
##CONTIG LINES OMMITTED
##bcftools_viewVersion=1.12+htslib-1.12
##bcftools_viewCommand=view -G FILEPATH/parrt_00-FILENAME.bgz; Date=Fri Mar 21 17:02:31 2025
#CHROM POS ID REF ALT QUAL FILTER INFO

Lnizz_b · March 31, 2025, 4:42pm

Hi to add, to this I also get the following error, although the file still writes out okay (This is run on the All of us google services system:

WARNING: An illegal reflective access operation has occurred (1 + 1) / 2]
WARNING: Illegal reflective access by org.apache.spark.util.SizeEstimator$ (file:/usr/lib/spark/jars/spark-core_2.12-3.3.0.jar) to field java.util.regex.Pattern.pattern
WARNING: Please consider reporting this to the maintainers of org.apache.spark.util.SizeEstimator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Traceback (most recent call last):=============================> (97 + 3) / 100]
File “/opt/conda/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/opt/conda/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py”, line 17, in
app.launch_new_instance()
File “/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py”, line 1043, in launch_instance
app.start()
File “/opt/conda/lib/python3.10/site-packages/ipykernel/kernelapp.py”, line 736, in start
self.io_loop.start()
File “/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py”, line 195, in start
self.asyncio_loop.run_forever()
File “/opt/conda/lib/python3.10/asyncio/base_events.py”, line 603, in run_forever
self._run_once()
File “/opt/conda/lib/python3.10/asyncio/base_events.py”, line 1894, in _run_once
handle = self._ready.popleft()
IndexError: pop from an empty deque

Topic		Replies	Views
`Table` to `MatrixTable` to export `VCF` Hail Query & hailctl	2	435	May 20, 2023
Exporting Hail MT to VCF - Missing Genotypes Hail Query & hailctl	11	243	May 8, 2024
Export_vcf(): Invalid type for format field 'gvcf_info' Hail Query & hailctl	12	892	July 23, 2020
VCFParseError on write MatrixTable Hail Query & hailctl	3	991	May 9, 2020
Export VCF header Hail Query & hailctl	1	428	November 15, 2019

Exporting Hail MT to VCF - FORMAT columns missing

Related topics