Hello, I’m currently working on a variant dataset (~3M variants, input vds size ~1.5G in disk) and trying out the linreg() functionality in hail. The pre-processing and qc steps go smoothly, and I’m also able to annotate the vds with covariates (~50) and expression data.
When I try running this with a smaller variant test set (a few hundred) in local mode, it computes very quickly, for both linreg() and linreg3(). However, running linreg() for ~20,000 genes in cluster mode is infeasible (I tried it, and it’s very slow), so I need to use linreg3(). Speed seems to scale with linreg3(), which is great, but memory seems to be the main bottleneck when I run in cluster mode - I tried various sizes of gene blocks, but I get the error:
Exception in thread “dispatcher-event-loop-(N)” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
…
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
Ideally I want to load in at least 32 genes at a time, but even gene block size of 10 gives me memory issues. I’ll attach more details about my setup:
#SBATCH -N 6
#SBATCH -t 24:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 4
spark-start
echo $MASTER
spark-submit --total-executor-cores 72 --executor-memory 6G
–jars file:///tigress/BEE/spark_hdfs/hail/jars/hail-all-spark.jar
–py-files file:///tigress/BEE/spark_hdfs/hail/python/hail.zip
file:///tigress/BEE/RNAseq_hail/Scripts/eqtls/gtex/v8_consortium/(name_of_script.py)
As you can see, I have a pretty extensive setup with 6 nodes and 72 cores with 6G each, so I don’t think I should be capped by memory, but I still run into this issue. It actually doesn’t error until 30-50 loops, with the output size from each loop being about 580M. I’ll also attach my main code in each loop:
assoc = analysis_set.annotate_samples_table(expr_kt.select(list(gene_set) + [‘ID’]), root = ‘sa.expr’)
.linreg3(y_list, covariates = cov_list).filter_variants_expr(‘min(va.linreg.pval) > %s’%(trans_pval_threshold), keep = False)
.write(‘file://’ + out_dir + ‘gene_chr’ + str(chrom+1) + ‘_part’ + str(n+1) + ‘.vds’)
analysis_set is my vds (persisted in ‘DISK_ONLY’ to save memory, I also tried without persist), expr_kt is the expression data (also persisted, I also tried without persist). assoc is overwritten each loop (But maybe copies of old assoc are kept around without being gc’ed?).
This is how I set up my HailContext:
from pyspark.sql import SparkSession
from hail import *
spark = SparkSession.builder.appName(“My Hail”).config(“spark.sql.files.openCostInBytes”, 1099511627776)
.config(“spark.sql.files.maxPartitionBytes”, 1099511627776).config(“spark.hadoop.parquet.block.size”, 1099511627776).getOrCreate()
sc = spark.sparkContext
hc = HailContext(sc)
But I also tried:
from hail import *
hc = HailContext()
My intuition is that somewhere in linreg3() function, a large set of Arrays (or ArrayLists) are created/copied, and it becomes memory inefficient? (I don’t think it’s the write() step, since each vds output is only around 580M in disk, but I could be wrong as well) Any help in terms of configuration settings/how I can run multiple linear regression more efficiently (both in terms of speed and memory) would be appreciated.
Thanks!
B