Writing my table as csv or vcf or ht takes too long

Hanin_Omar · March 27, 2025, 9:35am

My scripts is supposed to search for a list of variants in gnomad database using hail, and only retrieve the variants that matched if the exome_AF filed is <= 1%, however the bottleneck of the code is the exporting process, i have tried exporting to csv file, vcf file and even writing the resulting table into a hail table, they all take 2-3 hours, any suggestions how to optimize this process? keep in mind that the table to write has around 7000 rows with 25 columns each.

def export_filtered_data(result, output_path):
“”“Exports the filtered results to a Hail Table.”“”
start_time = time.time()
result.write(output_path, overwrite=True)
logging.info(f"Results written to ‘{output_path}’ in {time.time() - start_time:.2f} seconds")

njain · March 27, 2025, 7:46pm

I’m not sure if this would be applicable in your case, but for a similar case, I “pickled” my data and saved it as a compressed file. pickle — Python object serialization — Python 3.13.2 documentation

ehigham · April 8, 2025, 3:22pm

Hi @Hanin_Omar,
I’m sorry that hail seems to be taking an unreasonable amount of time exporting the filtered data.
Hail is a lazy language and so I cannot say why exporting is taking so long without seeing the rest of your pipeline upstream of export_filtered_data. Can you share it or a sample of it?
Best,

Hanin_Omar · April 9, 2025, 6:55am

gNomad.txt (7.1 KB)
sure, i have uploaded the full code, i am planning on optimizing the code and changing it, so if you have any recommendation on handling hail i appreciate it.
thank you

ehigham · April 9, 2025, 7:24pm

Hi @Hanin,
Thanks for sharing your code. Couple of things to note:

your step timings don’t actually represent the time it takes to perform any of these operations - these are actually all happening during export.
The hail library defers all computation until writes. This is could partly explain why writes seem to take so long given that all the other operations might seem deceptively fast.
you’ve got a couple of duplicate if-then-else expressions. I don’t think our optimiser is smart enough to fold those into one test unfortunately. You could group them together like below:

first_freq = filtered_table.gnomad.exomes.freq[0]
fafmax = filtered_table.gnomad.joint.fafmax

result = filtered_table.select(
  chrom=filtered_table.locus.contig,
  pos=filtered_table.locus.position,
  ref=filtered_table.alleles[0],
  alt=filtered_table.alleles[1],
  **hl.or_missing(
     hl.len(filtered_table.gnomad.exomes.freq) > 0,
       hl.struct(
         AC_E=first_freq.AF,
         AC_E=first_freq.AC,
         AN_E=first_freq.AN
         ...
       ),
   ),
   **hl.or_missing(
       hl.is_defined(fafmax),
       hl.struct(faf99_max_anc=fafmax.faf99_max, ...)
   ),
    ...

Hope this helps!

Hanin_Omar · May 4, 2025, 5:51am

thank you for the reply, i am brain storming to see if i can reduce the time for hail look up, if anybody has any tips that would be helpful, i have another question; when i try to annotate a vcf.gz file generated by illumina dragen that has 5 million variants it takes an hour, however when i try to annotate a vcf.gz file that was filtered using bed file from illumina that has 80000 variants it takes double the time, even when i sort and index the file using bcftools and tabix, can somebody explain this to me?

Topic		Replies	Views
Table.export issue Hail Query & hailctl	3	519	September 8, 2020
Hail export_vcf() extremely slow and stalls Hail Query & hailctl	4	562	February 2, 2023
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	451	December 5, 2023
How to speed up export_vcf? Hail Query & hailctl	6	494	September 24, 2021
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1905	August 23, 2024

Writing my table as csv or vcf or ht takes too long

Related topics