Hl.maximal_independent_set - job 'cancelled because SparkContext was shut down'

Yes, but I’ve also seen this issue appear in docker runtimes.

I should say that I have got the pc_relate function to work on a much smaller dataset (n=3,000 vs n=500,000), so it seems to be a sample size specific problem

Sam

That’s a pretty substantial increase in complexity. pc_relate scales as n_samples * n_samples * n_variants, so that’s a 27,000 times increase in asymptotic complexity. It’s possible there are data structures in the leader that are too large at that dataset size. I’d think setting the driver memory really large would avoid that issue.

How many variants are you using?

200,000 variants, can try to reduce a bit

I’d try to go down an order of magnitude there. There’s a lot of relatedness information captured in 200k common variants.

Hi,

I tried running with 93k variants but still no joy.

LOGGING: writing to /mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/relatedness/hail-20210202-1621-0.2.61-291a63b97bd9.log
2021-02-02 16:22:25 Hail: INFO: hwe_normalized_pca: running PCA using 93511 variants.
2021-02-02 16:23:23 Hail: INFO: pca: running PCA with 10 components...
2021-02-02 19:20:17 Hail: INFO: Wrote all 2760 blocks of 93511 x 488377 matrix with block size 4096.
2021-02-02 19:35:42 Hail: INFO: wrote matrix with 11 rows and 93511 columns as 23 blocks of size 4096 to /mnt/grid/janowitz/rdata_norepl/tmp/pcrelate-write-read-FwxtMIQglFbyGXpk5IJj3o.bm
2021-02-02 20:04:44 Hail: INFO: wrote matrix with 93511 rows and 488377 columns as 2760 blocks of size 4096 to /mnt/grid/janowitz/rdata_norepl/tmp/pcrelate-write-read-uTVvarxvTeN2pZEd5Q4cTy.bm
Traceback (most recent call last):
  File "relate.py", line 19, in <module>
    relatedness_ht.write("/mnt/grid/ukbiobank/data/Application58510/skleeman/relatedness_ukb.ht", overwrite=True)
  File "<decorator-gen-1095>", line 2, in write
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/table.py", line 1271, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms

Can you share the JVM stack trace as well? That should be the bit after Driver stacktrace:

We can replicate this error with the following benchmark:

This works on my laptop (takes a minute or two). It either times out or throws the same error in docker (whether running in docker on batch or in a container on my laptop):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 13, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 130453 ms
Driver stacktrace:

Copied below. We are considering running this on Google Cloud to avoid this but the job will cost at least $1000 I think.

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:112)
	at is.hail.utils.richUtils.RichRDD$.writePartitions$extension(RichRDD.scala:204)
	at is.hail.linalg.BlockMatrix.write(BlockMatrix.scala:872)
	at is.hail.methods.PCRelate.writeRead(PCRelate.scala:159)
	at is.hail.methods.PCRelate.gram(PCRelate.scala:165)
	at is.hail.methods.PCRelate.phi(PCRelate.scala:227)
	at is.hail.methods.PCRelate.computeResult(PCRelate.scala:184)
	at is.hail.methods.PCRelate.execute(PCRelate.scala:146)
	at is.hail.expr.ir.BlockMatrixToTableApply.execute(TableIR.scala:2786)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1846)
	at is.hail.expr.ir.TableFilter.execute(TableIR.scala:1280)
	at is.hail.expr.ir.TableKeyBy.execute(TableIR.scala:1210)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1846)
	at is.hail.expr.ir.TableKeyBy.execute(TableIR.scala:1210)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:360)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:344)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:341)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:12)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:254)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:341)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:385)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:383)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:383)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)