Subsetting large data

Hey @kforth !

I’m sorry to hear (1) didn’t work! Can you share a bit more detail on what happened or a stack trace? Based on the code you shared, I would not expect memory issues. For most operations, Hail fastidiously avoids reading the whole partition into memory and instead streams through the partition. In those cases, partitioning mostly controls the amount of parallelism available to Hail, not the memory requirements.

Based on the os.getenv, I wonder: are you using an on-prem cluster to do this analysis? In that case, you need to explicitly tell Java/Spark (a library on which Hail depends) how much memory to use. We have a post about how to do that: How do I increase the memory or RAM available to the JVM when I start Hail through Python?.