Subsetting large data

danking · September 28, 2022, 8:20pm

I’m sorry to hear (1) didn’t work! Can you share a bit more detail on what happened or a stack trace? Based on the code you shared, I would not expect memory issues. For most operations, Hail fastidiously avoids reading the whole partition into memory and instead streams through the partition. In those cases, partitioning mostly controls the amount of parallelism available to Hail, not the memory requirements.

Based on the os.getenv, I wonder: are you using an on-prem cluster to do this analysis? In that case, you need to explicitly tell Java/Spark (a library on which Hail depends) how much memory to use. We have a post about how to do that: How do I increase the memory or RAM available to the JVM when I start Hail through Python?.

Topic		Replies	Views
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1698	December 20, 2018
Subsetting a BGEN file in hail Hail Query & hailctl	1	46	October 4, 2024
Counting Rows More Quickly in VDS Hail Query & hailctl	12	513	July 17, 2023
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1836	August 23, 2024
Good approach to subset a hail matrix Hail Query & hailctl	2	618	April 23, 2022

Subsetting large data

Related topics