Memory issue - java.lang.OutOfMemoryError: Java heap space when running linreg3()

bjo · February 12, 2018, 6:39pm

Hello, I’m currently working on a variant dataset (~3M variants, input vds size ~1.5G in disk) and trying out the linreg() functionality in hail. The pre-processing and qc steps go smoothly, and I’m also able to annotate the vds with covariates (~50) and expression data.

When I try running this with a smaller variant test set (a few hundred) in local mode, it computes very quickly, for both linreg() and linreg3(). However, running linreg() for ~20,000 genes in cluster mode is infeasible (I tried it, and it’s very slow), so I need to use linreg3(). Speed seems to scale with linreg3(), which is great, but memory seems to be the main bottleneck when I run in cluster mode - I tried various sizes of gene blocks, but I get the error:

Exception in thread “dispatcher-event-loop-(N)” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
…
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)

Ideally I want to load in at least 32 genes at a time, but even gene block size of 10 gives me memory issues. I’ll attach more details about my setup:

#SBATCH -N 6
#SBATCH -t 24:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 4

spark-start
echo $MASTER
spark-submit --total-executor-cores 72 --executor-memory 6G
–jars file:///tigress/BEE/spark_hdfs/hail/jars/hail-all-spark.jar
–py-files file:///tigress/BEE/spark_hdfs/hail/python/hail.zip
file:///tigress/BEE/RNAseq_hail/Scripts/eqtls/gtex/v8_consortium/(name_of_script.py)

As you can see, I have a pretty extensive setup with 6 nodes and 72 cores with 6G each, so I don’t think I should be capped by memory, but I still run into this issue. It actually doesn’t error until 30-50 loops, with the output size from each loop being about 580M. I’ll also attach my main code in each loop:

assoc = analysis_set.annotate_samples_table(expr_kt.select(list(gene_set) + [‘ID’]), root = ‘sa.expr’)
.linreg3(y_list, covariates = cov_list).filter_variants_expr(‘min(va.linreg.pval) > %s’%(trans_pval_threshold), keep = False)
.write(‘file://’ + out_dir + ‘gene_chr’ + str(chrom+1) + ‘_part’ + str(n+1) + ‘.vds’)

analysis_set is my vds (persisted in ‘DISK_ONLY’ to save memory, I also tried without persist), expr_kt is the expression data (also persisted, I also tried without persist). assoc is overwritten each loop (But maybe copies of old assoc are kept around without being gc’ed?).

This is how I set up my HailContext:

from pyspark.sql import SparkSession
from hail import *
spark = SparkSession.builder.appName(“My Hail”).config(“spark.sql.files.openCostInBytes”, 1099511627776)
.config(“spark.sql.files.maxPartitionBytes”, 1099511627776).config(“spark.hadoop.parquet.block.size”, 1099511627776).getOrCreate()
sc = spark.sparkContext
hc = HailContext(sc)

But I also tried:

from hail import *
hc = HailContext()

My intuition is that somewhere in linreg3() function, a large set of Arrays (or ArrayLists) are created/copied, and it becomes memory inefficient? (I don’t think it’s the write() step, since each vds output is only around 580M in disk, but I could be wrong as well) Any help in terms of configuration settings/how I can run multiple linear regression more efficiently (both in terms of speed and memory) would be appreciated.

Thanks!

B

tpoterba · February 12, 2018, 6:58pm

Hi! a few questions to get started –

How many samples do you have? Many operations in Hail have a memory component that scales linearly with the number of samples.
How are you loading expression data onto the VDS? Are you doing annotate_samples_table with a table with 20K data points per sample? If so, this is the problem – we store the sample annotation table locally, and you can resolve the OOM issues by annotating a batch of genes at a time, using linreg3, and exporting.

bjo · February 12, 2018, 7:02pm

Hi!

In this example, I have ~650 samples
this is my annotation step: annotate_samples_table(expr_kt.select(list(gene_set) + [‘ID’]), root = ‘sa.expr’)

expr_kt is 650 by 20000, but it was my understanding that expr_kt.select(list(gene_set) + [‘ID’]) should return a keytable with the appropriate dimensions? (I set the length of gene_set at 32, and I tried scaling down to 10 as well).

tpoterba · February 12, 2018, 7:10pm

Ah, yeah, you’re already doing what my suggestion would have been.

It actually doesn’t error until 30-50 loops

I think you’re right that the problem is related to GC. I don’t know exactly where the issue could be, though – unless py4j is failing to GC java objects when their Python handles are GCed (I don’t think this is happening).

Something to try is not using Parquet, which has caused a lot of problems in 0.1 and caused us to shift away for the coming 0.2 version. VDS and KT files use Parquet, so I’d suggest just using export_variants to export a text file in regression loop.

Anyone else have ideas?

bjo · February 12, 2018, 7:18pm

I see, I think that’s worth a shot. I’ll circle back when I’ve given that a try. (Though it’s worth noting that I’ve also tried converting the result to .variants_table().flatten().select([‘va.rsid’, ‘va.qc.AF’, ‘va.linreg.ytx’, ‘va.linreg.beta’, ‘va.linreg.se’, ‘va.linreg.tstat’, ‘va.linreg.pval’]).to_pandas() and writing the pandas file in pickle, still with memory issues)

Thanks!

B

bjo · February 12, 2018, 9:01pm

Hi,

I tried the export_variants() method, and saw that it actually runs quite faster (probably because I’m only writing 6-7 columns in vds that I want), and I was able to get through more variants, but it eventually yielded the same error as before. (Before, I was able to get through about 10% of genes, now it’s closer to 20%).

Although this doesn’t resolve the issue itself, I realize I misunderstood the configuration --executor-memory - I thought this was executor memory per core (hence 6G), but I can probably set this to much higher if I’m running many more cores, and finish running the code without error.

But I think the issue itself remains - I wonder if there’s a way to check (without the web interface) how much memory is being used by each object?

Best,

Brian

tpoterba · February 12, 2018, 9:16pm

It’s probably running faster because converting to pandas is very slow even for small amounts of data, and the vds.write was writing genotypes unnecessarily.

Spark seems to be a memory hog, so 6G/4 cores might not be enough. Hopefully upping that solves the problem! We have a lot of trouble debugging Spark OOMs here as well.

Let us know if the problem doesn’t resolve after resetting the config!

Topic		Replies	Views
Heap out of memory Hail Query & hailctl	14	1617	July 21, 2020
Java Heap Space out of memory Hail Query & hailctl	5	3145	August 10, 2020
"Insufficient Memory" + performance issues on Google Cloud Help [0.1]	4	2693	July 26, 2017
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	611	January 21, 2020
Still struggling with OutOfMemoryError: Java heap space Hail Query & hailctl	0	995	March 4, 2019

Memory issue - java.lang.OutOfMemoryError: Java heap space when running linreg3()

Related Topics