Ld_prune() out of memory

rodrigo.barreiro · March 3, 2022, 1:04pm

Hello Hail team,

I’m trying to get the pruned variant table using ld_prune() but I get a Container killed on request. Exit code is 137 every time I try to run it. I’m using a matrix table of 1.1Tb (WGS of 1,300 individuals) and I’m working on a EMR Cluster on AWS with EC2 instances with 64Gb of memory and 256Gb of storage.

Am I doing something wrong? Is there estimative of how much memory do I need for this?

Thank you for the support!

Code

import hail as hl
hl.init()

sabe_mt = hl.read_matrix_table('s3://file.mt/')
sabe_mt = sabe_mt.filter_rows(hl.len(sabe_mt.alleles) == 2)
pruned_variant_table = hl.ld_prune(sabe_mt.GT, r2=0.1, bp_window_size=50000)

Spark Config (Memory)

spark.driver.memory    52131M
spark.executor.memory  51316M

Hail log (tail -1000)
ldprune_error.log (154.0 KB)

Python prompt
hail_prompt.txt (10.0 KB)

chrisvittal · March 3, 2022, 8:44pm

Thanks for reaching out. One clarification question, how much memory per core is this?

rodrigo.barreiro · March 3, 2022, 8:59pm

Hello Chris,

I’m not sure how I can get that info, any tips?

Best,
Rodrigo

danking · March 3, 2022, 10:03pm

Hey @rodrigo.barreiro !

I think Chris is asking what kind of EC2 instances are you using

rodrigo.barreiro · March 3, 2022, 10:41pm

Oh, I see. I’m using only m4.4xlarge (1 master, 2 core, 0~10 task nodes) instances. They have 16vCPU each and 64Gb of memory.

chrisvittal · March 4, 2022, 12:44am

Okay. Thanks for the info, to answer my question, 4 GB per core. LD prune in hail is a very memory intensive operation. On google cloud, our users typically use high memory machines, which have at least 6.5 GB per core.

I have two recommendations:

Increase the spark.executor.memory property to at least 90% of available memory on the machine. For the m4.4xlarge instances that you were using, this would be around 59000M
However, you should also use more memory optimized instances. The instances that closest match what our users on GCP use for LD prune are the r5d.2xlarge with 8 cores 64GiB memory, and a 300GiB SSD.

I hope this helps.

rodrigo.barreiro · March 4, 2022, 1:06am

Thanks for the insights, @chrisvittal !

I’ll give it a try. Do the task nodes have to be like this too?

chrisvittal · March 4, 2022, 1:13am

The task nodes should be high memory as well. You may be able to run regular r5 instances rather than r5d ones.

rodrigo.barreiro · March 7, 2022, 1:43pm

@chrisvittal, it worked! Thank you.

One point, when my cluster called more task nodes the task progress increment didn’t elevate, but it increased when I added more core nodes.

E.g (setup: progress bar):
2 core (16vCPUs) + 8 Task nodes: Stage X:==> (x + 32) /1000]
2 core (16vCPUs) + 16 Task nodes: Stage X:==> (x + 32) /1000]
4 core (16vCPUs) + 8 Task nodes: Stage X:==> (x + 64) /1000]

Are the task nodes doing anything in this case? Is there any recommended core/task ratio for this analysis?

Thank you again,

Rodrigo Barreiro

danking · March 14, 2022, 3:15pm

We recommend a 3:1 partition to core ratio.

I am as surprised as you are that doubling the number of workers did not. double the number of active partitions. I would ask Amazon support for help.

Topic		Replies	Views
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	694	January 21, 2020
HailException: Cannot create BlockMatrix: Hail Query & hailctl	2	393	February 19, 2020
Error while LD pruning variants - hail.utils.java.FatalError: IllegalArgumentException: requirement failed Hail Query & hailctl	2	409	May 3, 2023
Matrixtable filtering and LD pruning Error message - No space left on device Hail Query & hailctl	4	319	December 14, 2022
Hail massively leaking memory Hail Query & hailctl	6	693	January 12, 2023

Ld_prune() out of memory

Related topics