Error summary: OutOfMemoryError: Java heap space

I am having the memory issue using 480GB and 64cores machine

here is my command

mt = hl.import_vcf(’/mnt/shared/garvan/marpin/MGRB_phase2_SNPtier12_match_vqsr_minrep_locusannot_WGStier12_unrelated_filteredhomo_hetero.vcf.bgz’).write(’/mnt/ceph/mt-sinai/WGS_hail/MGRB.mt’, overwrite=True)

Error

[Stage 1:> (0 + 67) / 32390]
[Stage 1:> (9 + 64) / 32390]Traceback (most recent call last):
File “”, line 1, in
File “</home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/decorator.py:decorator-gen-1008>”, line 2, in write
File “/home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/hail/typecheck/check.py”, line 585, in wrapper
return original_func(*args, **kwargs)
File “/home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/hail/matrixtable.py”, line 2500, in write
Env.backend().execute(MatrixWrite(self._mir, writer))
File “/home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/hail/backend/backend.py”, line 108, in execute
result = json.loads(Env.hc()._jhc.backend().executeJSON(self._to_java_ir(ir)))
File “/home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/py4j/java_gateway.py”, line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File “/home/ubuntu/anaconda3/envs/hail/lib/python3.6/site-packages/hail/utils/java.py”, line 221, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: OutOfMemoryError: Java heap space

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in stage 1.0 failed 1 times, most recent failure: Lost task 49.0 in stage 1.0 (TID 50, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:335)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:795)
at org.apache.hadoop.io.Text.decode(Text.java:412)
at org.apache.hadoop.io.Text.decode(Text.java:389)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.SparkContext$$anonfun$textFile$1$$anonfun$apply$11.apply(SparkContext.scala:831)
at org.apache.spark.SparkContext$$anonfun$textFile$1$$anonfun$apply$11.apply(SparkContext.scala:831)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:1268)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:596)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$39.apply(RVD.scala:1295)
at is.hail.rvd.RVD$$anonfun$39.apply(RVD.scala:1293)
at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$32.apply(ContextRDD.scala:422)
at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$32.apply(ContextRDD.scala:422)
at is.hail.sparkextras.ContextRDD$$anonfun$run$1$$anonfun$apply$8.apply(ContextRDD.scala:192)
at is.hail.sparkextras.ContextRDD$$anonfun$run$1$$anonfun$apply$8.apply(ContextRDD.scala:192)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:196)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1299)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1363)
at is.hail.io.vcf.MatrixVCFReader.coercer$lzycompute(LoadVCF.scala:1559)
at is.hail.io.vcf.MatrixVCFReader.coercer(LoadVCF.scala:1559)
at is.hail.io.vcf.MatrixVCFReader.apply(LoadVCF.scala:1588)
at is.hail.expr.ir.TableRead.execute(TableIR.scala:294)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:768)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:90)
at is.hail.expr.ir.CompileAndEvaluate$$anonfun$1.apply(CompileAndEvaluate.scala:33)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:24)
at is.hail.expr.ir.CompileAndEvaluate$.apply(CompileAndEvaluate.scala:33)
at is.hail.backend.Backend$$anonfun$execute$1.apply(Backend.scala:86)
at is.hail.backend.Backend$$anonfun$execute$1.apply(Backend.scala:86)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:8)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:7)
at is.hail.utils.package$.using(package.scala:596)
at is.hail.annotations.Region$.scoped(Region.scala:11)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:7)
at is.hail.backend.Backend.execute(Backend.scala:86)
at is.hail.backend.Backend.executeJSON(Backend.scala:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:335)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:795)
at org.apache.hadoop.io.Text.decode(Text.java:412)
at org.apache.hadoop.io.Text.decode(Text.java:389)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.SparkContext$$anonfun$textFile$1$$anonfun$apply$11.apply(SparkContext.scala:831)
at org.apache.spark.SparkContext$$anonfun$textFile$1$$anonfun$apply$11.apply(SparkContext.scala:831)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
at is.hail.io.vcf.LoadVCF$$anonfun$parseLines$1$$anon$1.hasNext(LoadVCF.scala:1268)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:66)
at is.hail.rvd.RVDPartitionInfo$$anonfun$apply$1.apply(RVDPartitionInfo.scala:38)
at is.hail.utils.package$.using(package.scala:596)
at is.hail.rvd.RVDPartitionInfo$.apply(RVDPartitionInfo.scala:38)
at is.hail.rvd.RVD$$anonfun$39.apply(RVD.scala:1295)
at is.hail.rvd.RVD$$anonfun$39.apply(RVD.scala:1293)
at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$32.apply(ContextRDD.scala:422)
at is.hail.sparkextras.ContextRDD$$anonfun$cmapPartitionsWithIndex$1$$anonfun$apply$32.apply(ContextRDD.scala:422)
at is.hail.sparkextras.ContextRDD$$anonfun$run$1$$anonfun$apply$8.apply(ContextRDD.scala:192)
at is.hail.sparkextras.ContextRDD$$anonfun$run$1$$anonfun$apply$8.apply(ContextRDD.scala:192)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)

Hail version: 0.2.18-08ec699f0fd4
Error summary: OutOfMemoryError: Java heap space

Did you install Hail/Spark with pip? This means you’re running in local mode. Spark uses a really small amount of resources by default; you can ask for more using an environment variable:

PYSPARK_SUBMIT_ARGS="--driver-memory 400G pyspark-shell"

I also think that 32,390 partitions (processing tasks, unit of parallelism) is too many in this case, and I’m not totally sure how Spark created that many – this means ~15M per chunk, and the smallest file system default I’ve seen is 32M.

To fix this, do the following at the top of your script:

import hail as hl
hl.init(min_block_size=128)  # minimum 128MB

This will also ease memory pressure.

I did all what you suggested

32,390 partitions went down to 8098, but still same error

Hail version: 0.2.18-08ec699f0fd4
Error summary: OutOfMemoryError: Java heap space

can you share the Hail log file?