Hl.maximal_independent_set - job 'cancelled because SparkContext was shut down'

tpoterba · January 28, 2021, 7:38pm

Yes, but I’ve also seen this issue appear in docker runtimes.

samkleeman1 · January 28, 2021, 7:58pm

I should say that I have got the pc_relate function to work on a much smaller dataset (n=3,000 vs n=500,000), so it seems to be a sample size specific problem

Sam

danking · January 28, 2021, 8:23pm

That’s a pretty substantial increase in complexity. pc_relate scales as n_samples * n_samples * n_variants, so that’s a 27,000 times increase in asymptotic complexity. It’s possible there are data structures in the leader that are too large at that dataset size. I’d think setting the driver memory really large would avoid that issue.

How many variants are you using?

samkleeman1 · January 28, 2021, 8:34pm

200,000 variants, can try to reduce a bit

danking · January 28, 2021, 8:47pm

I’d try to go down an order of magnitude there. There’s a lot of relatedness information captured in 200k common variants.

samkleeman1 · February 3, 2021, 2:33pm

Hi,

I tried running with 93k variants but still no joy.

LOGGING: writing to /mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/relatedness/hail-20210202-1621-0.2.61-291a63b97bd9.log
2021-02-02 16:22:25 Hail: INFO: hwe_normalized_pca: running PCA using 93511 variants.
2021-02-02 16:23:23 Hail: INFO: pca: running PCA with 10 components...
2021-02-02 19:20:17 Hail: INFO: Wrote all 2760 blocks of 93511 x 488377 matrix with block size 4096.
2021-02-02 19:35:42 Hail: INFO: wrote matrix with 11 rows and 93511 columns as 23 blocks of size 4096 to /mnt/grid/janowitz/rdata_norepl/tmp/pcrelate-write-read-FwxtMIQglFbyGXpk5IJj3o.bm
2021-02-02 20:04:44 Hail: INFO: wrote matrix with 93511 rows and 488377 columns as 2760 blocks of size 4096 to /mnt/grid/janowitz/rdata_norepl/tmp/pcrelate-write-read-uTVvarxvTeN2pZEd5Q4cTy.bm
Traceback (most recent call last):
  File "relate.py", line 19, in <module>
    relatedness_ht.write("/mnt/grid/ukbiobank/data/Application58510/skleeman/relatedness_ukb.ht", overwrite=True)
  File "<decorator-gen-1095>", line 2, in write
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/table.py", line 1271, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/grid/wsbs/home_norepl/skleeman/hail/hail/python/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms

danking · February 3, 2021, 2:37pm

Can you share the JVM stack trace as well? That should be the bit after Driver stacktrace:

tpoterba · February 3, 2021, 2:52pm

We can replicate this error with the following benchmark:

github.com

hail-is/hail/blob/b0d9483166d1bfd1755fbf62ddcdd345508a6e4e/benchmark/python/benchmark_hail/run/linalg_benchmarks.py#L5-L11


@benchmark()
def block_matrix_nested_multiply():
    bm = hl.linalg.BlockMatrix.random(8 * 1024, 8 * 1024).checkpoint(hl.utils.new_temp_file(extension='bm'))
    path = hl.utils.new_temp_file(extension='bm')
    ((bm @ bm) @ bm @ bm @ (bm @ bm)).write(path, overwrite=True)
    return lambda: recursive_delete(path)

This works on my laptop (takes a minute or two). It either times out or throws the same error in docker (whether running in docker on batch or in a container on my laptop):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 13, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 130453 ms
Driver stacktrace:

samkleeman1 · February 4, 2021, 9:20pm

Copied below. We are considering running this on Google Cloud to avoid this but the job will cost at least $1000 I think.

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 306 in stage 90.0 failed 1 times, most recent failure: Lost task 306.0 in stage 90.0 (TID 10329, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 1412031 ms
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:112)
	at is.hail.utils.richUtils.RichRDD$.writePartitions$extension(RichRDD.scala:204)
	at is.hail.linalg.BlockMatrix.write(BlockMatrix.scala:872)
	at is.hail.methods.PCRelate.writeRead(PCRelate.scala:159)
	at is.hail.methods.PCRelate.gram(PCRelate.scala:165)
	at is.hail.methods.PCRelate.phi(PCRelate.scala:227)
	at is.hail.methods.PCRelate.computeResult(PCRelate.scala:184)
	at is.hail.methods.PCRelate.execute(PCRelate.scala:146)
	at is.hail.expr.ir.BlockMatrixToTableApply.execute(TableIR.scala:2786)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1846)
	at is.hail.expr.ir.TableFilter.execute(TableIR.scala:1280)
	at is.hail.expr.ir.TableKeyBy.execute(TableIR.scala:1210)
	at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1846)
	at is.hail.expr.ir.TableKeyBy.execute(TableIR.scala:1210)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:360)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:344)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:341)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:12)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:254)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:341)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:385)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:383)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:383)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Topic		Replies	Views
Pc_rel memory issue: ConnectionRefusedError: [Errno 111] Connection refused Hail Query & hailctl	10	817	June 11, 2024
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	705	January 21, 2020
PCA job aborted from SparkException Hail Query & hailctl	46	2687	July 28, 2020
Getting java heap error tried a bunch of things with the executor and memory settings Hail Batch & General Cloud	2	3462	October 5, 2022
Java Heap Space out of memory Hail Query & hailctl	5	3707	August 10, 2020

Hl.maximal_independent_set - job 'cancelled because SparkContext was shut down'

Related topics