Filtering is really fast but show is really slow, why is that?

Hail only performs computation the first time you observe the output, such as with collect, show, or write. When you evaluate

mt_f1 = mt.filter(...)

Hail does not process the data. Instead, Hail is building a recipe of things that need to be done in order to give you the results you want. Conceptually, mt_f1 is:

  1. Read the table from gs://gnomad-public…, call this mt
  2. Filter mt by using the values from genes_as_hail_literal

It is only at the time that you execute:

mt_f1.show()

that Hail actually executes the recipe you’ve built. If you execute:

mt_f1.show()
mt_f1.show()

Then Hail will run the entire recipe twice. Hail is designed this way because it operates on data that is too large to fit in memory. Hail is like a firehose, either the water is shooting out into something or it is off. There’s no way to pause the firehose half-way through a computation.

1 Like