Still struggling with OutOfMemoryError: Java heap space

Hi,

Hail is still crashing when dealing with the UK Biobank.
I used the following in my .bashrc, as you suggested:

export PYSPARK_SUBMIT_ARGS="\
  --jars $HAIL_HOME/build/libs/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=\"$HAIL_HOME/build/libs/hail-all-spark.jar\" \
  --conf spark.executor.extraClassPath=/usr/local/hail/build/libs/hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  --executor-memory 7g \
  --driver-memory 8g pyspark-shell"

This is what I’m trying to do, a simple exclusion of variants with INFO below 0.4, and patients with a missing phenotype or covariate:

autosomes = hl.import_bgen("/mnt/UKBBv3_bgens/ukb_imp_chr*_v3.bgen", entry_fields=['GP'], sample_file="/mnt/output/sb/UKBB/phenotypes/ukb9922_imp_chr1_v3_s487395.sample", n_partitions=18000)
info1 = (hl.import_table('/mnt/output/sb/UKBB/phenotypes/autosomes.info', impute=True).key_by('rsid'))
autosomes = autosomes.annotate_rows(**info1[autosomes.rsid])
autosomes = autosomes.filter_rows(autosomes.INFO>=0.4) # Filter out SNPs with info below 0.4
table1 = (hl.import_table('/mnt/output/sb/UKBB/phenotypes/UKBB_autosomes_phenotypes.4JOINT.txt', impute=False, delimiter='\s+', types={'ID_2': hl.tstr, 'sex' : hl.tint32, 'age' : hl.tfloat64, 'height' : hl.tfloat64, 'weight' : hl.tfloat64, 'pheno' : hl.tint32}).key_by('ID_2'))
table1 = table1.annotate(is_female = sex2bool(table1.sex))
table1 = table1.annotate(pheno_case = case(table1.pheno))
autosomes = autosomes.annotate_cols(**table1[autosomes.s])
autosomes = autosomes.filter_cols(~hl.is_missing(autosomes.weight)) 
autosomes = autosomes.filter_cols(~hl.is_missing(autosomes.age)) 
autosomes = autosomes.filter_cols(~hl.is_missing(autosomes.pheno)) 
# Save matrix
autosomes.write('/mnt/output/sb/UKBB/genotypes/UKBB_autosomes.mt')

I could get 10GB of RAM per core, but that is a lot of hassle (deleting my current VMs and reconstructing everything), and that would be the absolute maximum I could go to given my available resources. Last time we tried something similar we pushed it to 12GB and it made no difference. Of course I could sacrifice one of my VMs completely and use the memory to push it to 20GB on a single VM, but that would double my analysis time, so I’ll do that only if you don’t see any another solution (and hoping that would be enough!).

Would you have any other suggestion?

Complete stack trace:

Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 5:> (0 + 95) / 96][Stage 6:> (0 + 1) / 18011]Running on Apache Spark version 2.2.0
SparkUI available at http://10.5.1.118:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.10-91149b50a53c
LOGGING: writing to /mnt/output/sb/joint/hail-20190303-1006-0.2.10-91149b50a53c.log
2019-03-03 10:06:25 Hail: INFO: Number of BGEN files parsed: 22
2019-03-03 10:06:25 Hail: INFO: Number of samples in BGEN files: 487409
2019-03-03 10:06:25 Hail: INFO: Number of variants across all BGEN files: 93095623
2019-03-03 10:08:01 Hail: INFO: Number of BGEN files parsed: 1
2019-03-03 10:08:01 Hail: INFO: Number of samples in BGEN files: 486757
2019-03-03 10:08:01 Hail: INFO: Number of variants across all BGEN files: 3917799
2019-03-03 10:08:09 Hail: INFO: Number of BGEN files parsed: 1
2019-03-03 10:08:09 Hail: INFO: Number of samples in BGEN files: 486443
2019-03-03 10:08:09 Hail: INFO: Number of variants across all BGEN files: 45906
2019-03-03 10:08:10 Hail: INFO: Reading table to impute column types
2019-03-03 10:08:16 Hail: INFO: Finished type imputation
Loading column ‘rsid’ as type ‘str’ (imputed)
Loading column ‘INFO’ as type ‘float64’ (imputed)
2019-03-03 10:08:16 Hail: INFO: Reading table to impute column types
2019-03-03 10:08:17 Hail: INFO: Finished type imputation
Loading column ‘rsid’ as type ‘str’ (imputed)
Loading column ‘INFO’ as type ‘float64’ (imputed)
2019-03-03 10:08:17 Hail: INFO: Reading table to impute column types
2019-03-03 10:08:17 Hail: INFO: Finished type imputation
Loading column ‘rsid’ as type ‘str’ (imputed)
Loading column ‘INFO’ as type ‘float64’ (imputed)
2019-03-03 10:08:17 Hail: INFO: Reading table with no type imputation
Loading column ‘ID_2’ as type ‘str’ (user-specified)
Loading column ‘sex’ as type ‘int32’ (user-specified)
Loading column ‘age’ as type ‘float64’ (user-specified)
Loading column ‘height’ as type ‘float64’ (user-specified)
Loading column ‘weight’ as type ‘float64’ (user-specified)
Loading column ‘pheno’ as type ‘int32’ (user-specified)

2019-03-03 10:08:17 Hail: INFO: Reading table with no type imputation
Loading column ‘ID_2’ as type ‘str’ (user-specified)
Loading column ‘sex’ as type ‘int32’ (user-specified)
Loading column ‘age’ as type ‘float64’ (user-specified)
Loading column ‘height’ as type ‘float64’ (user-specified)
Loading column ‘weight’ as type ‘float64’ (user-specified)
Loading column ‘pheno’ as type ‘int32’ (user-specified)

2019-03-03 10:08:17 Hail: INFO: Reading table with no type imputation
Loading column ‘ID_2’ as type ‘str’ (user-specified)
Loading column ‘sex’ as type ‘int32’ (user-specified)
Loading column ‘age’ as type ‘float64’ (user-specified)
Loading column ‘height’ as type ‘float64’ (user-specified)
Loading column ‘weight’ as type ‘float64’ (user-specified)
Loading column ‘pheno’ as type ‘int32’ (user-specified)

2019-03-04 05:38:09 Hail: INFO: Ordering unsorted dataset with network shuffle
2019-03-04 05:38:21 Hail: INFO: Ordering unsorted dataset with network shuffle
Traceback (most recent call last):
File “./UKBB.py”, line 79, in
autosomes.write(‘/mnt/output/sb/UKBB/genotypes/UKBB_autosomes.mt’)
File “”, line 2, in write
File “/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 561, in wrapper
File “/usr/local/hail/build/distributions/hail-python.zip/hail/matrixtable.py”, line 2170, in write
File “/usr/local/hail/build/distributions/hail-python.zip/hail/backend/backend.py”, line 94, in execute
File “/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/usr/local/hail/build/distributions/hail-python.zip/hail/utils/java.py”, line 227, in deco
hail.utils.java.FatalError: OutOfMemoryError: Java heap space

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 5.0 failed 4 times, most recent failure: Lost task 16.3 in stage 5.0 (TID 18512, 10.4.1.209, executor 4): java.lang.OutOfMemoryError: Java heap space
at is.hail.io.BlockingOutputBuffer.(RowStore.scala:325)
at is.hail.io.BlockingBufferSpec.buildOutputBuffer(RowStore.scala:49)
at is.hail.io.PackCodecSpec$$anonfun$buildEncoder$2.apply(RowStore.scala:184)
at is.hail.io.PackCodecSpec$$anonfun$buildEncoder$2.apply(RowStore.scala:184)
at is.hail.annotations.RegionValue$$anonfun$toBytes$2.apply(RegionValue.scala:70)
at is.hail.annotations.RegionValue$$anonfun$toBytes$2.apply(RegionValue.scala:69)
at is.hail.utils.package$.using(package.scala:587)
at is.hail.annotations.RegionValue.toBytes(RegionValue.scala:69)
at is.hail.rvd.RVD$$anonfun$keyedEncodedRDD$1$$anonfun$apply$3.apply(RVD.scala:93)
at is.hail.rvd.RVD$$anonfun$keyedEncodedRDD$1$$anonfun$apply$3.apply(RVD.scala:91)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:157)
at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1065)
at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1129)
at is.hail.rvd.RVD$.coerce(RVD.scala:1087)
at is.hail.rvd.RVD$.coerce(RVD.scala:1073)
at is.hail.expr.ir.TableKeyByAndAggregate.execute(TableIR.scala:1236)
at is.hail.expr.ir.TableLeftJoinRightDistinct.execute(TableIR.scala:730)
at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:756)
at is.hail.expr.ir.TableFilter.execute(TableIR.scala:298)
at is.hail.expr.ir.TableMapGlobals.execute(TableIR.scala:898)
at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:756)
at is.hail.expr.ir.CastTableToMatrix.execute(MatrixIR.scala:1667)
at is.hail.expr.ir.MatrixMapCols.execute(MatrixIR.scala:1138)
at is.hail.expr.ir.CastMatrixToTable.execute(TableIR.scala:1452)
at is.hail.expr.ir.TableMapGlobals.execute(TableIR.scala:898)
at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:756)
at is.hail.expr.ir.TableMapGlobals.execute(TableIR.scala:898)
at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:756)
at is.hail.expr.ir.CastTableToMatrix.execute(MatrixIR.scala:1667)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:767)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:93)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:63)
at is.hail.expr.ir.Interpret$.interpretJSON(Interpret.scala:22)
at is.hail.expr.ir.Interpret.interpretJSON(Interpret.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

java.lang.OutOfMemoryError: Java heap space
at is.hail.io.BlockingOutputBuffer.(RowStore.scala:325)
at is.hail.io.BlockingBufferSpec.buildOutputBuffer(RowStore.scala:49)
at is.hail.io.PackCodecSpec$$anonfun$buildEncoder$2.apply(RowStore.scala:184)
at is.hail.io.PackCodecSpec$$anonfun$buildEncoder$2.apply(RowStore.scala:184)
at is.hail.annotations.RegionValue$$anonfun$toBytes$2.apply(RegionValue.scala:70)
at is.hail.annotations.RegionValue$$anonfun$toBytes$2.apply(RegionValue.scala:69)
at is.hail.utils.package$.using(package.scala:587)
at is.hail.annotations.RegionValue.toBytes(RegionValue.scala:69)
at is.hail.rvd.RVD$$anonfun$keyedEncodedRDD$1$$anonfun$apply$3.apply(RVD.scala:93)
at is.hail.rvd.RVD$$anonfun$keyedEncodedRDD$1$$anonfun$apply$3.apply(RVD.scala:91)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.10-91149b50a53c
Error summary: OutOfMemoryError: Java heap space