Hail only performs computation the first time you observe the output, such as with collect, show, or write. When you evaluate
mt_f1 = mt.filter(...)
Hail does not process the data. Instead, Hail is building a recipe of things that need to be done in order to give you the results you want. Conceptually, mt_f1
is:
- Read the table from gs://gnomad-public…, call this mt
- Filter mt by using the values from genes_as_hail_literal
It is only at the time that you execute:
mt_f1.show()
that Hail actually executes the recipe you’ve built. If you execute:
mt_f1.show()
mt_f1.show()
Then Hail will run the entire recipe twice. Hail is designed this way because it operates on data that is too large to fit in memory. Hail is like a firehose, either the water is shooting out into something or it is off. There’s no way to pause the firehose half-way through a computation.