Memory issue - java.lang.OutOfMemoryError: Java heap space when running linreg3()

Hello, I’m currently working on a variant dataset (~3M variants, input vds size ~1.5G in disk) and trying out the linreg() functionality in hail. The pre-processing and qc steps go smoothly, and I’m also able to annotate the vds with covariates (~50) and expression data.

When I try running this with a smaller variant test set (a few hundred) in local mode, it computes very quickly, for both linreg() and linreg3(). However, running linreg() for ~20,000 genes in cluster mode is infeasible (I tried it, and it’s very slow), so I need to use linreg3(). Speed seems to scale with linreg3(), which is great, but memory seems to be the main bottleneck when I run in cluster mode - I tried various sizes of gene blocks, but I get the error:

Exception in thread “dispatcher-event-loop-(N)” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)

at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)

Ideally I want to load in at least 32 genes at a time, but even gene block size of 10 gives me memory issues. I’ll attach more details about my setup:

#SBATCH -N 6
#SBATCH -t 24:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 4

spark-start
echo $MASTER
spark-submit --total-executor-cores 72 --executor-memory 6G
–jars file:///tigress/BEE/spark_hdfs/hail/jars/hail-all-spark.jar
–py-files file:///tigress/BEE/spark_hdfs/hail/python/hail.zip
file:///tigress/BEE/RNAseq_hail/Scripts/eqtls/gtex/v8_consortium/(name_of_script.py)

As you can see, I have a pretty extensive setup with 6 nodes and 72 cores with 6G each, so I don’t think I should be capped by memory, but I still run into this issue. It actually doesn’t error until 30-50 loops, with the output size from each loop being about 580M. I’ll also attach my main code in each loop:

assoc = analysis_set.annotate_samples_table(expr_kt.select(list(gene_set) + [‘ID’]), root = ‘sa.expr’)
.linreg3(y_list, covariates = cov_list).filter_variants_expr(‘min(va.linreg.pval) > %s’%(trans_pval_threshold), keep = False)
.write(‘file://’ + out_dir + ‘gene_chr’ + str(chrom+1) + ‘_part’ + str(n+1) + ‘.vds’)

analysis_set is my vds (persisted in ‘DISK_ONLY’ to save memory, I also tried without persist), expr_kt is the expression data (also persisted, I also tried without persist). assoc is overwritten each loop (But maybe copies of old assoc are kept around without being gc’ed?).

This is how I set up my HailContext:

from pyspark.sql import SparkSession
from hail import *
spark = SparkSession.builder.appName(“My Hail”).config(“spark.sql.files.openCostInBytes”, 1099511627776)
.config(“spark.sql.files.maxPartitionBytes”, 1099511627776).config(“spark.hadoop.parquet.block.size”, 1099511627776).getOrCreate()
sc = spark.sparkContext
hc = HailContext(sc)

But I also tried:

from hail import *
hc = HailContext()

My intuition is that somewhere in linreg3() function, a large set of Arrays (or ArrayLists) are created/copied, and it becomes memory inefficient? (I don’t think it’s the write() step, since each vds output is only around 580M in disk, but I could be wrong as well) Any help in terms of configuration settings/how I can run multiple linear regression more efficiently (both in terms of speed and memory) would be appreciated.

Thanks!

B

Hi! a few questions to get started –

  1. How many samples do you have? Many operations in Hail have a memory component that scales linearly with the number of samples.

  2. How are you loading expression data onto the VDS? Are you doing annotate_samples_table with a table with 20K data points per sample? If so, this is the problem – we store the sample annotation table locally, and you can resolve the OOM issues by annotating a batch of genes at a time, using linreg3, and exporting.

Hi!

  1. In this example, I have ~650 samples

  2. this is my annotation step: annotate_samples_table(expr_kt.select(list(gene_set) + [‘ID’]), root = ‘sa.expr’)

expr_kt is 650 by 20000, but it was my understanding that expr_kt.select(list(gene_set) + [‘ID’]) should return a keytable with the appropriate dimensions? (I set the length of gene_set at 32, and I tried scaling down to 10 as well).

Ah, yeah, you’re already doing what my suggestion would have been.

It actually doesn’t error until 30-50 loops

I think you’re right that the problem is related to GC. I don’t know exactly where the issue could be, though – unless py4j is failing to GC java objects when their Python handles are GCed (I don’t think this is happening).

Something to try is not using Parquet, which has caused a lot of problems in 0.1 and caused us to shift away for the coming 0.2 version. VDS and KT files use Parquet, so I’d suggest just using export_variants to export a text file in regression loop.

Anyone else have ideas?

I see, I think that’s worth a shot. I’ll circle back when I’ve given that a try. (Though it’s worth noting that I’ve also tried converting the result to .variants_table().flatten().select([‘va.rsid’, ‘va.qc.AF’, ‘va.linreg.ytx’, ‘va.linreg.beta’, ‘va.linreg.se’, ‘va.linreg.tstat’, ‘va.linreg.pval’]).to_pandas() and writing the pandas file in pickle, still with memory issues)

Thanks!

B

Hi,

I tried the export_variants() method, and saw that it actually runs quite faster (probably because I’m only writing 6-7 columns in vds that I want), and I was able to get through more variants, but it eventually yielded the same error as before. (Before, I was able to get through about 10% of genes, now it’s closer to 20%).

Although this doesn’t resolve the issue itself, I realize I misunderstood the configuration --executor-memory - I thought this was executor memory per core (hence 6G), but I can probably set this to much higher if I’m running many more cores, and finish running the code without error.

But I think the issue itself remains - I wonder if there’s a way to check (without the web interface) how much memory is being used by each object?

Best,

Brian

It’s probably running faster because converting to pandas is very slow even for small amounts of data, and the vds.write was writing genotypes unnecessarily.

Spark seems to be a memory hog, so 6G/4 cores might not be enough. Hopefully upping that solves the problem! We have a lot of trouble debugging Spark OOMs here as well.

Let us know if the problem doesn’t resolve after resetting the config!