LD pruning repeated errors

Hi,

Many thanks for making this amazing package available. I am having some difficulties with LD pruning. I am running this on our internal cluster and repeatedly getting errors to along the lines of:

Error summary: SparkException: Job aborted due to stage failure: Task 42 in stage 14.0 failed 1 times, most recent failure: Lost task 42.0 in stage 14.0 (TID 846, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 122832 ms

I have tried a number of changes suggested in these formats e.g. repartitioning data, changing block size. I have copied the code I am using below. I am running on 48 cores, each with 10GB of memory.

import hail as hl
import os
from gnomad.utils.liftover import *
#Define memory and CPU availability
os.environ["PYSPARK_SUBMIT_ARGS"] ="--master local[48] --driver-memory 480g pyspark-shell"
tmp = '/mnt/grid/ukbiobank/data/ApplicationXXXXX/skleeman/tmp'
hl.init(default_reference='GRCh38', master ='local[48]', local='local[48]',min_block_size=128, tmp_dir=tmp)
ukb = hl.read_matrix_table('/mnt/grid/ukbiobank/data/ApplicationXXXXXX/skleeman/ukb_grch38_filtered.mt')
print(ukb.count()) 
#Remove known high LD regions (from plinkQC repo)
intervals = hl.import_bed('/mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/remove_ld_grch38.bed',
                         reference_genome='GRCh38')
ukb = ukb.filter_rows(hl.is_defined(intervals[ukb.locus]),keep=False)
#Prune to r2 0.1
pruned_ht = hl.ld_prune(ukb.GT, r2=0.1,memory_per_core=5000)
ukb = ukb.filter_rows(hl.is_defined(pruned_ht[ukb.row_key]))
print(ukb.count()) 
ukb.write('/mnt/grid/ukbiobank/data/ApplicationXXXXXX/skleeman/ukb_grch38_filtered_pruned.mt', overwrite=True) #Save pruned MT

Would be extremely grateful for your advice with this!

Kind regards,

Sam Kleeman

The next couple of days are quite busy so I won’t be able to give a good response for a bit, but I have a few questions:

  1. This is running Spark in local mode on a single server with 48 cores, right? (not on a cluster)
  2. Can you paste the full stack trace?

This is the same error message I’ve seen running linear algebra benchmarks in docker, but I can’t reproduce it on my mac. I’m quite baffled.

Hi,

Many thanks for getting back to me. I am running this on our institute’s cluster run with UGE, but I get the same error with 96 cores, in which situations I would have dedicated control of one node.

I have copied the full trace below:

2020-12-06 20:20:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-12-06 20:20:10 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
  Compatibility is not guaranteed.
Running on Apache Spark version 2.4.1
SparkUI available at http://bam13.cm.cluster:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.60-de1845e1c2f6
LOGGING: writing to /mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/hail-20201206-2020-0.2.60-de1845e1c2f6.log
INFO (gnomad.sample_qc.pipeline 147): Creating QC MatrixTable
WARNING (gnomad.sample_qc.pipeline 150): The LD-prune step of this function requires non-preemptible workers only!
[Stage 6:==>                                                  (152 + 47) / 3139]2020-12-06 20:20:17 Hail: INFO: ld_prune: running local pruning stage with max queue size of 64777 variants
2020-12-07 01:32:42 Hail: INFO: wrote table with 6296439 rows in 250 partitions to /mnt/grid/janowitz/home/skleeman/tmp/4AVLJslgE6FnIUQBHKdo8b
    Total size: 134.49 MiB
    * Rows: 134.49 MiB
    * Globals: 11.00 B
    * Smallest partition: 6862 rows (144.68 KiB)
    * Largest partition:  63302 rows (1.52 MiB)
2020-12-07 01:53:23 Hail: INFO: Wrote all 1538 blocks of 6296439 x 3942 matrix with block size 4096.
Traceback (most recent call last):
  File "filter_ref.py", line 35, in <module>
    filter_lcr=False, filter_decoy=False, filter_segdup=False, min_inbreeding_coeff_threshold = -0.25)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/gnomad/sample_qc/pipeline.py", line 179, in get_qc_mt
    pruned_ht = hl.ld_prune(unfiltered_qc_mt.GT, r2=ld_r2)
  File "<decorator-gen-1723>", line 2, in ld_prune
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/statgen.py", line 3018, in ld_prune
    entries.i, entries.j, keep=False, tie_breaker=tie_breaker, keyed=False)
  File "<decorator-gen-1375>", line 2, in maximal_independent_set
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/misc.py", line 151, in maximal_independent_set
    edges.write(edges_path)
  File "<decorator-gen-1095>", line 2, in write
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/table.py", line 1271, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:109)
	at is.hail.io.RichContextRDDLong$.writeRows$extension(RichContextRDDRegionValue.scala:224)
	at is.hail.rvd.RVD.write(RVD.scala:797)
	at is.hail.expr.ir.TableNativeWriter.apply(TableWriter.scala:102)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.60-de1845e1c2f6
Error summary: SparkException: Job aborted due to stage failure: Task 193 in stage 6.0 failed 1 times, most recent failure: Lost task 193.0 in stage 6.0 (TID 2981, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 192875 ms
Driver stacktrace:

Kind regards

Sam

It’s possible this is a native code issue. @samkleeman1, can you share the log file (/mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/hail-20201206-2020-0.2.60-de1845e1c2f6.log)?

You might try compiling Hail from source on the machine in question.

Hi Dan,

Log file is available here - https://drive.google.com/file/d/1jhGy8hrQKircNXZueS6A2mn0xaUop8gu/view?usp=sharing

I will try to install from source - I assume the approach is to clone the GitHub repository then run ‘make install’.

Kind regards,

Sam

I think the following:

$ git clone ...
cd hail
HAIL_COMPILE_NATIVES=1 make -C hail install

I have tried installing from source and am seeing the same error unfortunately. Keen to do the LD pruning without Hail if possible, it seems somewhat circular to export to PLINK then import again once pruning done.

Running on Apache Spark version 2.4.1
SparkUI available at http://bam13.cm.cluster:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-9d133c0e5186
LOGGING: writing to /mnt/grid/janowitz/home/skleeman/ukbiobank/cancergwas/hail-20201207-1345-0.2.61-9d133c0e5186.log
INFO (gnomad.sample_qc.pipeline 147): Creating QC MatrixTable
WARNING (gnomad.sample_qc.pipeline 150): The LD-prune step of this function requires non-preemptible workers only!
[Stage 6:===>                                                 (183 + 47) / 3139]2020-12-07 13:45:44 Hail: INFO: ld_prune: running local pruning stage with max queue size of 64777 variants
2020-12-07 19:20:29 Hail: INFO: wrote table with 6296439 rows in 250 partitions to /mnt/grid/janowitz/home/skleeman/tmp/wQgATT86AOB6sSJVa8MjEg
    Total size: 134.49 MiB
    * Rows: 134.49 MiB
    * Globals: 11.00 B
    * Smallest partition: 6862 rows (144.68 KiB)
    * Largest partition:  63302 rows (1.52 MiB)
2020-12-07 19:55:31 Hail: INFO: Wrote all 1538 blocks of 6296439 x 3942 matrix with block size 4096.
Traceback (most recent call last):
  File "filter_ref.py", line 35, in <module>
    filter_lcr=False, filter_decoy=False, filter_segdup=False, min_inbreeding_coeff_threshold = -0.25)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/gnomad/sample_qc/pipeline.py", line 179, in get_qc_mt
    pruned_ht = hl.ld_prune(unfiltered_qc_mt.GT, r2=ld_r2)
  File "<decorator-gen-1725>", line 2, in ld_prune
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/statgen.py", line 3022, in ld_prune
    entries.i, entries.j, keep=False, tie_breaker=tie_breaker, keyed=False)
  File "<decorator-gen-1377>", line 2, in maximal_independent_set
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/methods/misc.py", line 151, in maximal_independent_set
    edges.write(edges_path)
  File "<decorator-gen-1095>", line 2, in write
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/table.py", line 1271, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/grid/wsbs/home_norepl/skleeman/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 202 in stage 6.0 failed 1 times, most recent failure: Lost task 202.0 in stage 6.0 (TID 2990, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 163114 ms
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 202 in stage 6.0 failed 1 times, most recent failure: Lost task 202.0 in stage 6.0 (TID 2990, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 163114 ms
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:109)
	at is.hail.io.RichContextRDDLong$.writeRows$extension(RichContextRDDRegionValue.scala:224)
	at is.hail.rvd.RVD.write(RVD.scala:797)
	at is.hail.expr.ir.TableNativeWriter.apply(TableWriter.scala:102)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.61-9d133c0e5186
Error summary: SparkException: Job aborted due to stage failure: Task 202 in stage 6.0 failed 1 times, most recent failure: Lost task 202.0 in stage 6.0 (TID 2990, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 163114 ms

Actually a clarification - I am not able to install from source due to lack of lz4.h, I was only able to run make install but without the parameter HAIL_COMPILE_NATIVES=1. I have attempted to install this (in my path as I do not have root privileges) but the Hail install is not recognizing this

For some a spark executor (a worker) goes silent for 2 minutes, is considered lost by the leader who then promptly does nothing for a few minutes before failing.

LD Prune is a bit fickle. My first thought is: you seem to have a truly massive table: 6,296,439. How many variants are you sending into ld prune? Can you try using an allele frequency filter before that? I don’t know what you plan to do with the LD pruned variants, but I recall folks normally sending closer to a couple thousand variants into LD prune. That should also dramatically improve the run-time.

Yes I am running LD prune on the gnomad 1000k/hgrp dataset, and this is already a filtered set by allele frequency. Based on the gnomad code in their QC GitHub repo, they ran this exact command successfully to perform LD pruning in this cohort.

Ah, then that’s definitely what is wrong. HAIL_COMPILE_NATIVES=1 is what actually builds the system-specific libraries. If you can’t install lz4 in the usual place, you need to tell Hail where to find those files. The way to do this is to set CXXFLAGS to include -Ipath/to/lz4/directory/, like this:

CXXFLAGS='-Ipath/to/lz4/directory' HAIL_COMPILE_NATIVES=1 make install-editable

This command is not working unfortunately even if I modify to

CXXFLAGS='-Ipath/to/lz4/directory' HAIL_COMPILE_NATIVES=1 make -C hail install-editable

Lz4 is effectively installed to three directories (bin, include and lib).

I get error *** No rule to make target 'lz4.h', needed by 'build/Decoder.o'. Stop.

Run make clean once then re-run. Hail is stuck on old configuration from when lz4.h did not exist.

Following make clean, this command gets a little way through
CXXFLAGS='/grid/wsbs/home_norepl/skleeman/Software/LocalInstall/usr/local/include/lz4.h' HAIL_COMPILE_NATIVES=1 make -C hail install-editable

Then I get:

    -ggdb -fno-strict-aliasing -I../resources/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-0.el8_2.x86_64/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-0.el8_2.x86_64/include/linux ApproximateQuantiles_test.cpp -MG -M -MF build/ApproximateQuantiles_test.d -MT build/ApproximateQuantiles_test.o

    g++ -o build/NativeBoot.o /grid/wsbs/home_norepl/skleeman/Software/LocalInstall/usr/local/include/lz4.h -march=sandybridge -O3 -std=c++14 -Ilibsimdpp-2.1 -Wall -Wextra -fPIC -ggdb -fno-strict-aliasing -I../resources/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-0.el8_2.x86_64/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.265.b01-0.el8_2.x86_64/include/linux -MD -MF build/NativeBoot.d -MT build/NativeBoot.o -c NativeBoot.cpp

    **g++:** **fatal error:** cannot specify ‘ **-o** ’ with ‘ **-c** ’, ‘ **-S** ’ or ‘ **-E** ’ with multiple files

    compilation terminated.

    make[1]: *** [Makefile:65: build/NativeBoot.o] Error 1

    make[1]: Leaving directory '/grid/wsbs/home_norepl/skleeman/hail/hail/src/main/c'

    make: *** [Makefile:355: native-lib-prebuilt] Error 2

    make: Leaving directory '/grid/wsbs/home_norepl/skleeman/hail/hail'

You need to use CXXFLAGS=-Ipath/to/lz4, note the -I. Try this:

CXXFLAGS='-I/grid/wsbs/home_norepl/skleeman/Software/LocalInstall/usr/local/include/lz4.h' HAIL_COMPILE_NATIVES=1 make -C hail install-editable

Incredible! That worked, for the benefit of anyone reading this in the future, the command that worked was:

CXXFLAGS='-I /grid/wsbs/home_norepl/skleeman/Software/LocalInstall/usr/local/include/' HAIL_COMPILE_NATIVES=1 make -C hail install-editable

I will run the LD pruning script again and let you know!
Thanks so much for the help!!! :grinning:

1 Like

To follow up on this, I am still getting this error mentioned above, albeit a bit inconsistently, example copied below, I have tried modifying most of the parameters with ld_prune (including batch size, memory per core, total memory). I will have to use PLINK to do this step which is a little unfortunate.

Error summary: SparkException: Job aborted due to stage failure: Task 45 in stage 14.0 failed 1 times, most recent failure: Lost task 45.0 in stage 14.0 (TID 855, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 130105 ms