My scripts is supposed to search for a list of variants in gnomad database using hail, and only retrieve the variants that matched if the exome_AF filed is <= 1%, however the bottleneck of the code is the exporting process, i have tried exporting to csv file, vcf file and even writing the resulting table into a hail table, they all take 2-3 hours, any suggestions how to optimize this process? keep in mind that the table to write has around 7000 rows with 25 columns each.
def export_filtered_data(result, output_path):
“”“Exports the filtered results to a Hail Table.”“”
start_time = time.time()
result.write(output_path, overwrite=True)
logging.info(f"Results written to ‘{output_path}’ in {time.time() - start_time:.2f} seconds")
Hi @Hanin_Omar,
I’m sorry that hail seems to be taking an unreasonable amount of time exporting the filtered data.
Hail is a lazy language and so I cannot say why exporting is taking so long without seeing the rest of your pipeline upstream of export_filtered_data. Can you share it or a sample of it?
Best,
gNomad.txt (7.1 KB)
sure, i have uploaded the full code, i am planning on optimizing the code and changing it, so if you have any recommendation on handling hail i appreciate it.
thank you
Hi @Hanin,
Thanks for sharing your code. Couple of things to note:
your step timings don’t actually represent the time it takes to perform any of these operations - these are actually all happening during export.
The hail library defers all computation until writes. This is could partly explain why writes seem to take so long given that all the other operations might seem deceptively fast.
you’ve got a couple of duplicate if-then-else expressions. I don’t think our optimiser is smart enough to fold those into one test unfortunately. You could group them together like below: