Hi,
Many thanks for getting back to me. I am running this on our institute’s cluster run with UGE, but I get the same error with 96 cores, in which situations I would have dedicated control of one node.
I have copied the full trace below:
2020-12-06 20:20:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-12-06 20:20:10 WARN Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
Compatibility is not guaranteed.
Running on Apache Spark version 2.4.1
SparkUI available at http://bam13.cm.cluster:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.60-de1845e1c2f6
LOGGING: writing to /mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/hail-20201206-2020-0.2.60-de1845e1c2f6.log
INFO (gnomad.sample_qc.pipeline 147): Creating QC MatrixTable
WARNING (gnomad.sample_qc.pipeline 150): The LD-prune step of this function requires non-preemptible workers only!
[Stage 6:==> (152 + 47) / 3139]2020-12-06 20:20:17 Hail: INFO: ld_prune: running local pruning stage with max queue size of 64777 variants
2020-12-07 01:32:42 Hail: INFO: wrote table with 6296439 rows in 250 partitions to /mnt/grid/janowitz/home/skleeman/tmp/4AVLJslgE6FnIUQBHKdo8b
Total size: 134.49 MiB
* Rows: 134.49 MiB
* Globals: 11.00 B
* Smallest partition: 6862 rows (144.68 KiB)
* Largest partition: 63302 rows (1.52 MiB)
2020-12-07 01:53:23 Hail: INFO: Wrote all 1538 blocks of 6296439 x 3942 matrix with block size 4096.
Traceback (most recent call last):
File "filter_ref.py", line 35, in <module>
filter_lcr=False, filter_decoy=False, filter_segdup=False, min_inbreeding_coeff_threshold = -0.25)
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/gnomad/sample_qc/pipeline.py", line 179, in get_qc_mt
pruned_ht = hl.ld_prune(unfiltered_qc_mt.GT, r2=ld_r2)
File "<decorator-gen-1723>", line 2, in ld_prune
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
return __original_func(*args_, **kwargs_)
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/statgen.py", line 3018, in ld_prune
entries.i, entries.j, keep=False, tie_breaker=tie_breaker, keyed=False)
File "<decorator-gen-1375>", line 2, in maximal_independent_set
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
return __original_func(*args_, **kwargs_)
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/misc.py", line 151, in maximal_independent_set
edges.write(edges_path)
File "<decorator-gen-1095>", line 2, in write
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
return __original_func(*args_, **kwargs_)
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/table.py", line 1271, in write
Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
raise e
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 74, in execute
result = json.loads(self._jhc.backend().executeJSON(jir))
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:109)
at is.hail.io.RichContextRDDLong$.writeRows$extension(RichContextRDDRegionValue.scala:224)
at is.hail.rvd.RVD.write(RVD.scala:797)
at is.hail.expr.ir.TableNativeWriter.apply(TableWriter.scala:102)
at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
at is.hail.utils.package$.using(package.scala:618)
at is.hail.annotations.Region$.scoped(Region.scala:18)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.60-de1845e1c2f6
Error summary: SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:
Kind regards
Sam