Hello,
My Hail job is crashing with the error message, “is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB.” I notice that the gcloud
command that hailctl
issues passes a Spark environment property HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=6323
. What are the consequences of changing this value? The default value assigned in the source code (src/main/is/hail/annotations/RegionPool.scala:20) is
Long.MaxValue`, which is 9,223,372,036,854,775,807 MB.
Thank you,
Daniel Cotter
For reference:
- I am getting the error on a Dataproc cluster on GCP running Debian 11, Hadoop 3.3.3, Spark 3.3.0, and Hail 0.2.112.
- The autoscaling policy for the cluster allows up to 2500 n1-highmem-8 workers to be spawned with 1000-GB boot disks.
- I am using the following
hailctl
command to create the cluster:hailctl dataproc start dcotter-std-gp-autoscaling-jupyter \ --image-version=2.1.7-debian11 \ --autoscaling-policy=mvp-data-release-2-100k-genomes-autoscaling-policy \ --master-machine-type=n1-highmem-8 \ --worker-machine-type=n1-highmem-8 \ --worker-boot-disk-size=1000 \ --secondary-worker-type=non-preemptible \ --preemptible-worker-boot-disk-size=1000 \ --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true,spark:spark.sql.shuffle.partitions=24240,spark:spark.default.parallelism=24240
- Which results in the following
gcloud
command:gcloud dataproc clusters create dcotter-std-gp-autoscaling-jupyter \ --image-version=2.1.2-debian11 \ --properties=^|||^spark:spark.task.maxFailures=20|||spark:spark.driver.extraJavaOptions=-Xss4M|||spark:spark.executor.extraJavaOptions=-Xss4M|||spark:spark.speculation=true|||hdfs:dfs.replication=1|||dataproc:dataproc.logging.stackdriver.enable=true|||dataproc:dataproc.monitoring.stackdriver.enable=true|||spark:spark.sql.shuffle.partitions=24240|||spark:spark.default.parallelism=24240|||spark:spark.driver.memory=41g|||yarn:yarn.nodemanager.resource.memory-mb=50585|||yarn:yarn.scheduler.maximum-allocation-mb=25292|||spark:spark.executor.cores=4|||spark:spark.executor.memory=10117m|||spark:spark.executor.memoryOverhead=15175m|||spark:spark.memory.storageFraction=0.2|||spark:spark.executorEnv.HAIL_WORKER_OFF_HEAP_MEMORY_PER_CORE_MB=6323 \ --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.112/init_notebook.py \ --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.112/hail-0.2.112-py3-none-any.whl|||PKGS=aiohttp==3.8.4|aiohttp-session==2.12.0|aiosignal==1.3.1|async-timeout==4.0.2|asyncinit==0.2.4|asynctest==0.13.0|attrs==22.2.0|avro==1.11.1|azure-core==1.26.3|azure-identity==1.12.0|azure-storage-blob==12.14.1|bokeh==1.4.0|boto3==1.26.73|botocore==1.29.73|cachetools==5.3.0|certifi==2022.12.7|cffi==1.15.1|charset-normalizer==3.0.1|commonmark==0.9.1|cryptography==39.0.1|decorator==4.4.2|deprecated==1.2.13|dill==0.3.6|frozenlist==1.3.3|google-api-core==2.11.0|google-auth==2.14.1|google-cloud-core==2.3.2|google-cloud-storage==2.7.0|google-crc32c==1.5.0|google-resumable-media==2.4.1|googleapis-common-protos==1.58.0|humanize==1.1.0|hurry-filesize==0.9|idna==3.4|isodate==0.6.1|janus==1.0.0|jinja2==3.0.3|jmespath==1.0.1|markupsafe==2.1.2|msal==1.21.0|msal-extensions==1.0.0|msrest==0.7.1|multidict==6.0.4|nest-asyncio==1.5.6|numpy==1.21.6|oauthlib==3.2.2|orjson==3.8.6|packaging==23.0|pandas==1.3.5|parsimonious==0.8.1|pillow==9.4.0|plotly==5.10.0|portalocker==2.7.0|protobuf==3.20.2|py4j==0.10.9.5|pyasn1==0.4.8|pyasn1-modules==0.2.8|pycparser==2.21|pygments==2.14.0|pyjwt[crypto]==2.6.0|python-dateutil==2.8.2|python-json-logger==2.0.6|pytz==2022.7.1|pyyaml==6.0|requests==2.28.2|requests-oauthlib==1.3.1|rich==12.6.0|rsa==4.9|s3transfer==0.6.0|scipy==1.7.3|six==1.16.0|sortedcontainers==2.4.0|tabulate==0.9.0|tenacity==8.2.1|tornado==6.2|typing-extensions==4.5.0|urllib3==1.26.14|uvloop==0.17.0;sys_platform!="win32"|wrapt==1.14.1|yarl==1.8.2 \ --master-machine-type=n1-highmem-8 \ --master-boot-disk-size=100GB \ --num-master-local-ssds=0 \ --num-secondary-workers=0 \ --num-worker-local-ssds=0 \ --num-workers=2 \ --secondary-worker-boot-disk-size=1000GB \ --worker-boot-disk-size=1000GB \ --worker-machine-type=n1-highmem-8 \ --initialization-action-timeout=20m \ --image-version=2.1.7-debian11 \ --autoscaling-policy=mvp-data-release-2-100k-genomes-autoscaling-policy \ --secondary-worker-type=non-preemptible
- The stacktrace and error message are as follows:
Traceback (most recent call last):=====================>(120896 + 148) / 121044] File "/tmp/caefcabe45074f398f4e195732308206/burden-testing-wgs-100k-annotated-230220.py", line 173, in <module> burden_results = hl.linear_regression_rows( File "<decorator-gen-1710>", line 2, in linear_regression_rows File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/methods/statgen.py", line 375, in linear_regression_rows return ht_result.persist() File "<decorator-gen-1106>", line 2, in persist File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 2126, in persist return Env.backend().persist_table(self) File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/backend.py", line 172, in persist_table return t.checkpoint(tf.__enter__()) File "<decorator-gen-1096>", line 2, in checkpoint File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1345, in checkpoint self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec) File "<decorator-gen-1098>", line 2, in write File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper return __original_func(*args_, **kwargs_) File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1391, in write Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec))) File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute raise e.maybe_user_error(ir) from None File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed) File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None hail.utils.java.FatalError: HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643) Java stack trace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13345 in stage 16.0 failed 20 times, most recent failure: Lost task 13345.19 in stage 16.0 (TID 124643) (dcotter-std-gp-autoscaling-jupyter-sw-l15b.c.gbsc-gcp-project-mvp.internal executor 986): is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643) at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17) at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17) at is.hail.utils.package$.fatal(package.scala:78) at is.hail.annotations.RegionPool.closeAndThrow(RegionPool.scala:58) at is.hail.annotations.RegionPool.incrementAllocatedBytes(RegionPool.scala:73) at is.hail.annotations.ChunkCache.newChunk(ChunkCache.scala:75) at is.hail.annotations.ChunkCache.getChunk(ChunkCache.scala:114) at is.hail.annotations.RegionPool.getChunk(RegionPool.scala:96) at is.hail.annotations.RegionMemory.allocateBigChunk(RegionMemory.scala:62) at is.hail.annotations.RegionMemory.allocate(RegionMemory.scala:96) at is.hail.annotations.Region.allocate(Region.scala:332) at __C10331stream.__m10627split_ToArray(Unknown Source) at __C10331stream.__m10626begin_group_0(Unknown Source) at __C10331stream.apply_region164_170(Unknown Source) at __C10331stream.apply_region10_174(Unknown Source) at __C10331stream.apply_region8_272(Unknown Source) at __C10331stream.apply(Unknown Source) at is.hail.expr.ir.CompileIterator$anon$2.step(Compile.scala:303) at is.hail.expr.ir.CompileIterator$LongIteratorWrapper.hasNext(Compile.scala:156) at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490) at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2(RVD.scala:1257) at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2$adapted(RVD.scala:1256) at is.hail.sparkextras.ContextRDD.$anonfun$crunJobWithIndex$1(ContextRDD.scala:242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2673) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2609) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2608) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2861) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2803) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2792) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2257) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2289) at is.hail.sparkextras.ContextRDD.crunJobWithIndex(ContextRDD.scala:238) at is.hail.rvd.RVD$.getKeyInfo(RVD.scala:1256) at is.hail.rvd.RVD$.makeCoercer(RVD.scala:1331) at is.hail.rvd.RVD$.coerce(RVD.scala:1287) at is.hail.rvd.RVD.changeKey(RVD.scala:144) at is.hail.rvd.RVD.changeKey(RVD.scala:137) at is.hail.backend.spark.SparkBackend.lowerDistributedSort(SparkBackend.scala:741) at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.$anonfun$apply$1(LowerAndExecuteShuffles.scala:72) at is.hail.expr.ir.RewriteBottomUp$.$anonfun$apply$4(RewriteBottomUp.scala:26) at is.hail.utils.StackSafe$More.advance(StackSafe.scala:60) at is.hail.utils.StackSafe$.run(StackSafe.scala:16) at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32) at is.hail.expr.ir.RewriteBottomUp$.apply(RewriteBottomUp.scala:36) at is.hail.expr.ir.lowering.LowerAndExecuteShuffles$.apply(LowerAndExecuteShuffles.scala:20) at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.transform(LoweringPass.scala:157) at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16) at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81) at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16) at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81) at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14) at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13) at is.hail.expr.ir.lowering.LowerAndExecuteShufflesPass.apply(LoweringPass.scala:151) at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22) at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20) at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50) at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:454) at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:490) at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75) at is.hail.utils.package$.using(package.scala:635) at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75) at is.hail.utils.package$.using(package.scala:635) at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17) at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63) at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:342) at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:487) at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52) at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:486) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) is.hail.utils.HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643) at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17) at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17) at is.hail.utils.package$.fatal(package.scala:78) at is.hail.annotations.RegionPool.closeAndThrow(RegionPool.scala:58) at is.hail.annotations.RegionPool.incrementAllocatedBytes(RegionPool.scala:73) at is.hail.annotations.ChunkCache.newChunk(ChunkCache.scala:75) at is.hail.annotations.ChunkCache.getChunk(ChunkCache.scala:114) at is.hail.annotations.RegionPool.getChunk(RegionPool.scala:96) at is.hail.annotations.RegionMemory.allocateBigChunk(RegionMemory.scala:62) at is.hail.annotations.RegionMemory.allocate(RegionMemory.scala:96) at is.hail.annotations.Region.allocate(Region.scala:332) at __C10331stream.__m10627split_ToArray(Unknown Source) at __C10331stream.__m10626begin_group_0(Unknown Source) at __C10331stream.apply_region164_170(Unknown Source) at __C10331stream.apply_region10_174(Unknown Source) at __C10331stream.apply_region8_272(Unknown Source) at __C10331stream.apply(Unknown Source) at is.hail.expr.ir.CompileIterator$anon$2.step(Compile.scala:303) at is.hail.expr.ir.CompileIterator$LongIteratorWrapper.hasNext(Compile.scala:156) at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:490) at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2(RVD.scala:1257) at is.hail.rvd.RVD$.$anonfun$getKeyInfo$2$adapted(RVD.scala:1256) at is.hail.sparkextras.ContextRDD.$anonfun$crunJobWithIndex$1(ContextRDD.scala:242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Hail version: 0.2.112-31ceff2fb5fd Error summary: HailException: Hail off-heap memory exceeded maximum threshold: limit 6.17 GiB, allocated 6.18 GiB Report: 6.2G allocated (6.8M blocks / 6.2G chunks), regions.size = 17, 0 current java objects, thread 282: Executor task launch worker for task 13345.19 in stage 16.0 (TID 124643)