"Hail off-heap memory exceeded maximum threshold" error on large analysis job

Hello,

My Hail job is crashing with the error message, “is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB.” I notice that the gcloud command that hailctl issues passes a Spark environment property HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=6323. What are the consequences of changing this value? The default value assigned in the source code (src/main/is/hail/annotations/RegionPool.scala:20) is Long.MaxValue`, which is 9,223,372,036,854,775,807 MB.

Thank you,
Daniel Cotter

For reference:

  • I am getting the error on a Dataproc cluster on GCP running Debian 11, Hadoop 3.3.3, Spark 3.3.0, and Hail 0.2.112.
  • The autoscaling policy for the cluster allows up to 2500 n1-highmem-8 workers to be spawned with 1000-GB boot disks.
  • I am using the following hailctl command to create the cluster:
    hailctl dataproc start dcotter-std-gp-autoscaling-jupyter \
      --image-version=2.1.7-debian11 \
      --autoscaling-policy=mvp-data-release-2-100k-genomes-autoscaling-policy  \
      --master-machine-type=n1-highmem-8 \
      --worker-machine-type=n1-highmem-8 \
      --worker-boot-disk-size=1000 \
      --secondary-worker-type=non-preemptible \
      --preemptible-worker-boot-disk-size=1000 \
      --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=24240,spark:spark.default.parallelism=24240
    
  • Which results in the following gcloud command:
    gcloud dataproc clusters create dcotter-std-gp-autoscaling-jupyter \
        --image-version=2.1.2-debian11 \
        --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=true|||dataproc:dataproc.monitoring.stackdriver.enable=true|||spark:spark.sql.shuffle.partitions=24240|||spark:spark.default.parallelism=24240|||spark:spark.driver.memory=41g|||yarn:yarn.nodemanager.resource.memory-mb=50585|||yarn:yarn.scheduler.maximum-allocation-mb=25292|||spark:spark.executor.cores=4|||spark:spark.executor.memory=10117m|||spark:spark.executor.memoryOverhead=15175m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=6323 \
        --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.112/init_notebook.py \
        --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.112/hail-0.2.112-py3-none-any.whl|||PKGS=aiohttp==3.8.4|aiohttp-session==2.12.0|aiosignal==1.3.1|async-timeout==4.0.2|asyncinit==0.2.4|asynctest==0.13.0|attrs==22.2.0|avro==1.11.1|azure-core==1.26.3|azure-identity==1.12.0|azure-storage-blob==12.14.1|bokeh==1.4.0|boto3==1.26.73|botocore==1.29.73|cachetools==5.3.0|certifi==2022.12.7|cffi==1.15.1|charset-normalizer==3.0.1|commonmark==0.9.1|cryptography==39.0.1|decorator==4.4.2|deprecated==1.2.13|dill==0.3.6|frozenlist==1.3.3|google-api-core==2.11.0|google-auth==2.14.1|google-cloud-core==2.3.2|google-cloud-storage==2.7.0|google-crc32c==1.5.0|google-resumable-media==2.4.1|googleapis-common-protos==1.58.0|humanize==1.1.0|hurry-filesize==0.9|idna==3.4|isodate==0.6.1|janus==1.0.0|jinja2==3.0.3|jmespath==1.0.1|markupsafe==2.1.2|msal==1.21.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.6|numpy==1.21.6|oauthlib==3.2.2|orjson==3.8.6|packaging==23.0|pandas==1.3.5|parsimonious==0.8.1|pillow==9.4.0|plotly==5.10.0|portalocker==2.7.0|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.4.8|pyasn1-modules==0.2.8|pycparser==2.21|pygments==2.14.0|pyjwt[crypto]==2.6.0|python-dateutil==2.8.2|python-json-logger==2.0.6|pytz==2022.7.1|pyyaml==6.0|requests==2.28.2|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.6.0|scipy==1.7.3|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.1|tornado==6.2|typing-extensions==4.5.0|urllib3==1.26.14|uvloop==0.17.0;sys_platform!="win32"|wrapt==1.14.1|yarl==1.8.2 \
        --master-machine-type=n1-highmem-8 \
        --master-boot-disk-size=100GB \
        --num-master-local-ssds=0 \
        --num-secondary-workers=0 \
        --num-worker-local-ssds=0 \
        --num-workers=2 \
        --secondary-worker-boot-disk-size=1000GB \
        --worker-boot-disk-size=1000GB \
        --worker-machine-type=n1-highmem-8 \
        --initialization-action-timeout=20m \
        --image-version=2.1.7-debian11 \
        --autoscaling-policy=mvp-data-release-2-100k-genomes-autoscaling-policy \
        --secondary-worker-type=non-preemptible
    
  • The stacktrace and error message are as follows:
    Traceback (most recent call last):=====================>(120896 + 148) / 121044]
      File "/tmp/caefcabe45074f398f4e195732308206/burden-testing-wgs-100k-annotated-230220.py", line 173, in <module>
        burden_results = hl.linear_regression_rows(
      File "<decorator-gen-1710>", line 2, in linear_regression_rows
      File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper
        return __original_func(*args_, **kwargs_)
      File "/opt/conda/default/lib/python3.10/site-packages/hail/methods/statgen.py", line 375, in linear_regression_rows
        return ht_result.persist()
      File "<decorator-gen-1106>", line 2, in persist
      File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper
        return __original_func(*args_, **kwargs_)
      File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 2126, in persist
        return Env.backend().persist_table(self)
      File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/backend.py", line 172, in persist_table
        return t.checkpoint(tf.__enter__())
      File "<decorator-gen-1096>", line 2, in checkpoint
      File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper
        return __original_func(*args_, **kwargs_)
      File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1345, in checkpoint
        self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
      File "<decorator-gen-1098>", line 2, in write
      File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper
        return __original_func(*args_, **kwargs_)
      File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1391, in write
        Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
      File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
        raise e.maybe_user_error(ir) from None
      File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
        result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
      File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
      File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
        raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
    hail.utils.java.FatalError: HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB
    Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643)
    
    Java stack trace:
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 13345 in stage 16.0 failed 20 times, most recent failure: Lost task 13345.19 in stage 16.0 (TID 124643) (dcotter-std-gp-autoscaling-jupyter-sw-l15b.c.gbsc-gcp-project-mvp.internal executor 986): is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB
    Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643)
    	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
    	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
    	at is.hail.utils.package$.fatal(package.scala:78)
    	at is.hail.annotations.RegionPool.closeAndThrow(RegionPool.scala:58)
    	at is.hail.annotations.RegionPool.incrementAllocatedBytes(RegionPool.scala:73)
    	at is.hail.annotations.ChunkCache.newChunk(ChunkCache.scala:75)
    	at is.hail.annotations.ChunkCache.getChunk(ChunkCache.scala:114)
    	at is.hail.annotations.RegionPool.getChunk(RegionPool.scala:96)
    	at is.hail.annotations.RegionMemory.allocateBigChunk(RegionMemory.scala:62)
    	at is.hail.annotations.RegionMemory.allocate(RegionMemory.scala:96)
    	at is.hail.annotations.Region.allocate(Region.scala:332)
    	at __C10331stream.__m10627split_ToArray(Unknown Source)
    	at __C10331stream.__m10626begin_group_0(Unknown Source)
    	at __C10331stream.apply_region164_170(Unknown Source)
    	at __C10331stream.apply_region10_174(Unknown Source)
    	at __C10331stream.apply_region8_272(Unknown Source)
    	at __C10331stream.apply(Unknown Source)
    	at is.hail.expr.ir.CompileIterator$anon$2.step(Compile.scala:303)
    	at is.hail.expr.ir.CompileIterator$LongIteratorWrapper.hasNext(Compile.scala:156)
    	at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460)
    	at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460)
    	at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490)
    	at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2(RVD.scala:1257)
    	at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2$adapted(RVD.scala:1256)
    	at is.hail.sparkextras.ContextRDD.$anonfun$crunJobWithIndex$1(ContextRDD.scala:242)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:136)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    
    Driver stacktrace:
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2673)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2609)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2608)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2608)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
    	at scala.Option.foreach(Option.scala:407)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2861)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2803)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2792)
    	at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2257)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2289)
    	at is.hail.sparkextras.ContextRDD.crunJobWithIndex(ContextRDD.scala:238)
    	at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1256)
    	at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1331)
    	at is.hail.rvd.RVD$.coerce(RVD.scala:1287)
    	at is.hail.rvd.RVD.changeKey(RVD.scala:144)
    	at is.hail.rvd.RVD.changeKey(RVD.scala:137)
    	at is.hail.backend.spark.SparkBackend.lowerDistributedSort(SparkBackend.scala:741)
    	at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.$anonfun$apply$1(LowerAndExecuteShuffles.scala:72)
    	at is.hail.expr.ir.RewriteBottomUp$.$anonfun$apply$4(RewriteBottomUp.scala:26)
    	at is.hail.utils.StackSafe$More.advance(StackSafe.scala:60)
    	at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
    	at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
    	at is.hail.expr.ir.RewriteBottomUp$.apply(RewriteBottomUp.scala:36)
    	at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.apply(LowerAndExecuteShuffles.scala:20)
    	at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.transform(LoweringPass.scala:157)
    	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
    	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
    	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
    	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
    	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
    	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
    	at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.apply(LoweringPass.scala:151)
    	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
    	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
    	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
    	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:454)
    	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:490)
    	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
    	at is.hail.utils.package$.using(package.scala:635)
    	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
    	at is.hail.utils.package$.using(package.scala:635)
    	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
    	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
    	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:342)
    	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:487)
    	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
    	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:486)
    	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    	at py4j.Gateway.invoke(Gateway.java:282)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    
    is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB
    Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643)
    	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
    	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
    	at is.hail.utils.package$.fatal(package.scala:78)
    	at is.hail.annotations.RegionPool.closeAndThrow(RegionPool.scala:58)
    	at is.hail.annotations.RegionPool.incrementAllocatedBytes(RegionPool.scala:73)
    	at is.hail.annotations.ChunkCache.newChunk(ChunkCache.scala:75)
    	at is.hail.annotations.ChunkCache.getChunk(ChunkCache.scala:114)
    	at is.hail.annotations.RegionPool.getChunk(RegionPool.scala:96)
    	at is.hail.annotations.RegionMemory.allocateBigChunk(RegionMemory.scala:62)
    	at is.hail.annotations.RegionMemory.allocate(RegionMemory.scala:96)
    	at is.hail.annotations.Region.allocate(Region.scala:332)
    	at __C10331stream.__m10627split_ToArray(Unknown Source)
    	at __C10331stream.__m10626begin_group_0(Unknown Source)
    	at __C10331stream.apply_region164_170(Unknown Source)
    	at __C10331stream.apply_region10_174(Unknown Source)
    	at __C10331stream.apply_region8_272(Unknown Source)
    	at __C10331stream.apply(Unknown Source)
    	at is.hail.expr.ir.CompileIterator$anon$2.step(Compile.scala:303)
    	at is.hail.expr.ir.CompileIterator$LongIteratorWrapper.hasNext(Compile.scala:156)
    	at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460)
    	at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460)
    	at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490)
    	at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2(RVD.scala:1257)
    	at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2$adapted(RVD.scala:1256)
    	at is.hail.sparkextras.ContextRDD.$anonfun$crunJobWithIndex$1(ContextRDD.scala:242)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:136)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    
    
    
    
    Hail version: 0.2.112-31ceff2fb5fd
    Error summary: HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB
    Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643)
    

By default, Hail explicitly partitions memory between the JVM and off-heap data structures. This environment variable controls the off-heap portion, and can often lead to better error messages (handled exception rather than OOM sigkill) for things that weren’t going to work anyway. It’s possible that some pipelines succeed without this explicit partitioning – often peak JVM memory use and peak off-heap memory use don’t happen at the same time. You can turn off explicit memory partitioning on cluster creation with --no-off-heap-memory.

It’s possible that there’s a memory leak in Hail, or it’s possible that this pipeline is just very memory-intensive. Can you share the script you’re running?