Filtering is really fast but show is really slow, why is that?

danking · June 19, 2020, 3:11pm

Hail only performs computation the first time you observe the output, such as with collect, show, or write. When you evaluate

mt_f1 = mt.filter(...)

Hail does not process the data. Instead, Hail is building a recipe of things that need to be done in order to give you the results you want. Conceptually, mt_f1 is:

Read the table from gs://gnomad-public…, call this mt
Filter mt by using the values from genes_as_hail_literal

It is only at the time that you execute:

mt_f1.show()

that Hail actually executes the recipe you’ve built. If you execute:

mt_f1.show()
mt_f1.show()

Then Hail will run the entire recipe twice. Hail is designed this way because it operates on data that is too large to fit in memory. Hail is like a firehose, either the water is shooting out into something or it is off. There’s no way to pause the firehose half-way through a computation.

Topic		Replies	Views
Way for slicing on Hail table Hail Query & hailctl	9	513	September 22, 2022
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2168	February 8, 2020
Read DBNSFP file Hail Query & hailctl	11	894	December 2, 2019
Writing my table as csv or vcf or ht takes too long Hail Query & hailctl	5	70	May 4, 2025
Hail 0.2 - filter intervals from a bed file Hail Query & hailctl	11	2175	March 2, 2021

Filtering is really fast but show is really slow, why is that?

Related topics