Hi @smcnulty,
I’m sorry you’re having trouble with Hail!
Regarding speed, Hail is lazy, so it only executes your pipeline when you observe the output. For example, write
, show
, and collect
All the dataset annotation is done by the write
step.
Hail will already read as little of the gnomad data as possible, filtering based on another table won’t improve upon what Hail is doing. I’m not sure why that causes you to run out of memory. Hopefully someone from the compiler team can comment on that.
Regarding running this faster, what is your compute environment? It appears that you’re using a laptop. Have you tried setting PYSPARK_SUBMIT_ARGS to use all the memory on your laptop?