Filtering is really fast but show is really slow, why is that?

yes! it worked thanks. Last question before closing this. Any reason that when I want to use show() or any other method on ht_filt the first time it prints out the output (it takes quite a time) and the second time it gets stuck and I need to restart the kernal and rerun the code.

Hail only performs computation the first time you observe the output, such as with collect, show, or write. When you evaluate

mt_f1 = mt.filter(...)

Hail does not process the data. Instead, Hail is building a recipe of things that need to be done in order to give you the results you want. Conceptually, mt_f1 is:

  1. Read the table from gs://gnomad-public…, call this mt
  2. Filter mt by using the values from genes_as_hail_literal

It is only at the time that you execute:

mt_f1.show()

that Hail actually executes the recipe you’ve built. If you execute:

mt_f1.show()
mt_f1.show()

Then Hail will run the entire recipe twice. Hail is designed this way because it operates on data that is too large to fit in memory. Hail is like a firehose, either the water is shooting out into something or it is off. There’s no way to pause the firehose half-way through a computation.

1 Like

Thank you danking! So basically always subset the table in the beginning by head() and continue scripting and making sure if my script does what I want and then run it later on the data-table itself (basic scripting rule which also applies here, of course!).

PS. Now I have been watching your tutorials on hail and I understand it better! But your explanation here is actually very useful and I could not get it from the tutorials. It would be really nice if you guys can record a short video discussing basic hail scripting(how to convert the question to chain of hail syntax ) as it slightly different what we are used to do in R or in python.