Spark executor heartbeat timeout during hl.king()

From Spark executor heartbeat timeout during hl.king() · Issue #12290 · hail-is/hail · GitHub

Our team is currently trying to run kinship analysis with king() on just under 110k samples. We have run this successfully in the past on 10k samples using a google cloud cluster with the following configuration.

hailctl dataproc start cluster --vep GRCh38 \
	--requester-pays-allow-annotation-db \
       	--packages gnomad --requester-pays-allow-buckets gnomad-public-requester-pays \
	--master-machine-type=n1-highmem-8 --worker-machine-type=n1-highmem-8 \
	--num-workers=300	--num-secondary-workers=0 \
	--worker-boot-disk-size=1000 \
	--properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true

We are currently receiving a spark error when using this cluster for our larger dataset.

[Stage 10:=====>                                               (69 + 656) / 729]
    raise err
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 582 in stage 10.0 failed 20 times, most recent failure: Lost task 582.19 in stage 10.0 (TID 461381) (cluster-w-144.c.project-.internal executor 3568): ExecutorLostFailure (executor 3568 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128936 ms
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 582 in stage 10.0 failed 20 times, most recent failure: Lost task 582.19 in stage 10.0 (TID 461381) (cluster-w-144.c.gbsc-project.internal executor 3568): ExecutorLostFailure (executor 3568 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128936 ms
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2304)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2253)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2252)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2252)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2491)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2433)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2422)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2204)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2225)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2244)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:105)
	at is.hail.utils.richUtils.RichRDD$.writePartitions$extension(RichRDD.scala:209)
	at is.hail.linalg.BlockMatrix.write(BlockMatrix.scala:871)
	at is.hail.expr.ir.BlockMatrixNativeWriter.apply(BlockMatrixWriter.scala:46)
	at is.hail.expr.ir.BlockMatrixNativeWriter.apply(BlockMatrixWriter.scala:39)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:858)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:57)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:646)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:646)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.98-f8833c1ae16b
Error summary: SparkException: Job aborted due to stage failure: Task 582 in stage 10.0 failed 20 times, most recent failure: Lost task 582.19 in stage 10.0 (TID 461381) (cluster-w-144.c.project.internal executor 3568): ExecutorLostFailure (executor 3568 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 128936 ms
Driver stacktrace:

We also tried using a similar cluster configuration with dynamic scaling and received the same error. Do you have a recommended cluster configuration to run king() on this many samples?

Update: We have resolved this issue. An explanation has been been documented in the GitHub issue above.

1 Like