PCA job aborted from SparkException

Awesome. @tpoterba let me know when it is patched! If I am using Terra would I need to ask them to update their runtime environments for the fix to propagate?

Yes, we’ll need to make a patch to update the Hail version in Terra once we release.

Fix is in flight here:

Thanks, will follow on Github

@tpoterba it looks like this fix was merged, thanks for your help! Do you know if there is an approximate timeline for the next Hail release?

I am in communication with the Terra notebook engineers; they may have also gotten in contact to coordinate updating their Hail install after the next release. Thanks!

We released yesterday and made a change request against Terra, so I’d expect it to be updated soon, if not already.

Still experiencing some challenges getting this to work. I am using Hail 0.2.45.

The good news: I can load in the UK Biobank BED files and no longer experience the import_plink() overflow error!

The bad news is even with the smaller # of variants from using the genotype data and NOT the imputed genotype data, I still run into Spark memory errors. For reference, I am configured as a spark cluster; my teacher node has 16 CPUs, 60 GB memory and 50 GB disk size; I have 53 preemptible worker nodes with 4CPUs, 26GB memory, and 50GB disk size each.

After variant filtering I have 39628 samples and 581323 variants in my MatrixTable. Running hl.hwe_normalized_pca I get the following error:

2020-07-01 16:11:56 Hail: INFO: hwe_normalized_pca: running PCA using 559712 variants.

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-25-92b1fdebb591> in <module>
      1 # Compute PCA
----> 2 eigenvalues, pcs, _ = hl.hwe_normalized_pca(gen_filt.GT, k=10, compute_loadings=False)

<decorator-gen-1549> in hwe_normalized_pca(call_expr, k, compute_loadings)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/methods/statgen.py in hwe_normalized_pca(call_expr, k, compute_loadings)
   1593     return pca(normalized_gt,
   1594                k,
-> 1595                compute_loadings)
   1596 
   1597 

<decorator-gen-1551> in pca(entry_expr, k, compute_loadings)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/methods/statgen.py in pca(entry_expr, k, compute_loadings)
   1695         'entryField': field,
   1696         'k': k,
-> 1697         'computeLoadings': compute_loadings
   1698     })).persist())
   1699 

<decorator-gen-1095> in persist(self, storage_level)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/table.py in persist(self, storage_level)
   1834             Persisted table.
   1835         """
-> 1836         return Env.backend().persist_table(self, storage_level)
   1837 
   1838     def unpersist(self) -> 'Table':

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    313 
    314     def persist_table(self, t, storage_level):
--> 315         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    316 
    317     def unpersist_table(self, t):

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in deco(*args, **kwargs)
     39             raise FatalError('%s\n\nJava stack trace:\n%s\n'
     40                              'Hail version: %s\n'
---> 41                              'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
     42         except pyspark.sql.utils.CapturedException as e:
     43             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: SparkException: Job aborted due to stage failure: Task 9 in stage 3.0 failed 4 times, most recent failure: Lost task 9.3 in stage 3.0 (TID 94, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-50.c.mycompany-research-and-development.internal, executor 23): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 3.0 failed 4 times, most recent failure: Lost task 9.3 in stage 3.0 (TID 94, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-50.c.mycompany-research-and-development.internal, executor 23): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.rvd.RVD.countPerPartition(RVD.scala:744)
	at is.hail.expr.ir.MatrixValue.toRowMatrix(MatrixValue.scala:241)
	at is.hail.methods.PCA.execute(PCA.scala:33)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:49)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2409)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:23)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:402)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:401)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
	at is.hail.utils.package$.using(package.scala:601)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:401)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Job aborted due to stage failure: Task 9 in stage 3.0 failed 4 times, most recent failure: Lost task 9.3 in stage 3.0 (TID 94, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-50.c.mycompany-research-and-development.internal, executor 23): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Next, I tried to reduce my MatrixTable further with hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000) and I got the following error trying to run that:

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-26-0e16bd0e7bab> in <module>
      1 # Try pruning variants in LD if too many
----> 2 gen_prune = hl.ld_prune(gen_filt.GT, r2=0.2, bp_window_size=500000)
      3 print('After LD pruning, %d samples and %d variants remain.' % (gen_filt.gen_prune(), gen_filt.gen_prune()))

<decorator-gen-1575> in ld_prune(call_expr, r2, bp_window_size, memory_per_core, keep_higher_maf, block_size)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/methods/statgen.py in ld_prune(call_expr, r2, bp_window_size, memory_per_core, keep_higher_maf, block_size)
   3544             (mt[field].n_alt_alleles() - mt.info.mean) * mt.info.centered_length_rec,
   3545             0.0),
-> 3546         block_size=block_size)
   3547     r2_bm = (std_gt_bm @ std_gt_bm.T) ** 2
   3548 

<decorator-gen-1427> in from_entry_expr(cls, entry_expr, mean_impute, center, normalize, axis, block_size)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/linalg/blockmatrix.py in from_entry_expr(cls, entry_expr, mean_impute, center, normalize, axis, block_size)
    407         path = new_temp_file()
    408         cls.write_from_entry_expr(entry_expr, path, overwrite=False, mean_impute=mean_impute,
--> 409                                   center=center, normalize=normalize, axis=axis, block_size=block_size)
    410         return cls.read(path)
    411 

<decorator-gen-1439> in write_from_entry_expr(entry_expr, path, overwrite, mean_impute, center, normalize, axis, block_size)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/linalg/blockmatrix.py in write_from_entry_expr(entry_expr, path, overwrite, mean_impute, center, normalize, axis, block_size)
    696             else:
    697                 field = Env.get_uid()
--> 698                 mt.select_entries(**{field: entry_expr})._write_block_matrix(path, overwrite, field, block_size)
    699         else:
    700             mt = mt.select_entries(__x=entry_expr).unfilter_entries()

/usr/local/lib/python3.7/dist-packages/hail/matrixtable.py in _write_block_matrix(self, path, overwrite, entry_field, block_size)
   4110              'overwrite': overwrite,
   4111              'entryField': entry_field,
-> 4112              'blockSize': block_size}))
   4113 
   4114     def _calculate_new_partitions(self, n_partitions):

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in execute(self, ir, timed)
    294         jir = self._to_java_value_ir(ir)
    295         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
--> 296         result = json.loads(self._jhc.backend().executeJSON(jir))
    297         value = ir.typ._from_json(result['value'])
    298         timings = result['timings']

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in deco(*args, **kwargs)
     39             raise FatalError('%s\n\nJava stack trace:\n%s\n'
     40                              'Hail version: %s\n'
---> 41                              'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
     42         except pyspark.sql.utils.CapturedException as e:
     43             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 227, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-47.c.mycompany-research-and-development.internal, executor 49): ExecutorLostFailure (executor 49 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Java stack trace:
java.lang.RuntimeException: error while applying lowering 'InterpretNonCompilable'
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:26)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:18)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:18)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:317)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:304)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:303)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
	at is.hail.utils.package$.using(package.scala:601)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:303)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:323)
	at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 227, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-47.c.mycompany-research-and-development.internal, executor 49): ExecutorLostFailure (executor 49 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.rvd.RVD.countPerPartition(RVD.scala:744)
	at is.hail.expr.ir.functions.MatrixWriteBlockMatrix.execute(MatrixWriteBlockMatrix.scala:25)
	at is.hail.expr.ir.functions.WrappedMatrixToValueFunction.execute(RelationalFunctions.scala:88)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:735)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:50)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:45)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:18)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:18)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:317)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:304)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:303)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
	at is.hail.utils.package$.using(package.scala:601)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:303)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:323)
	at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 227, saturn-4a455550-c59e-45ba-9489-b0809295d82c-w-47.c.mycompany-research-and-development.internal, executor 49): ExecutorLostFailure (executor 49 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  10.0 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

The most confusing part is I keep getting the message “10.0 GB of 10 GB physical memory used” when my nodes have 26GB memory each.

Per this link: https://stackoverflow.com/questions/40781354/container-killed-by-yarn-for-exceeding-memory-limits-10-4-gb-of-10-4-gb-physic
I tried to set ‘yarn.nodemanager.vmem-check-enabled’: ‘false’ in hl.init() which was accepted but did not do anything. I also have ‘spark.executor.cores’: ‘4’ in my init()

I’m hoping this can be resolved by tweaking my Spark configuration in the init() call a bit, but I don’t really understand how these variables affect “physical memory”

Was able to boost the physical memory some by setting “‘spark.executor.memory’:‘18g’”, which is about as high as I can get it even though my worker node memory is set to 26G. I am still getting a SparkException, though the error message now says I am using 20GB of 20GB physical memory (see below).

Is there anything I can do to get this to run? Perhaps ~600k variants is still too memory intensive but I can’t really take any action to prune this table further since ld_prune() is also quite memory intensive.

I appreciate of of your help on this!

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-25-e3aac58407be> in <module>
      1 # Compute PCA
----> 2 eigenvalues, pcs, _ = hl.hwe_normalized_pca(gen_filt.GT, k=10, compute_loadings=False)
      3 #eigenvalues, pcs, _ = hl.hwe_normalized_pca(gen_prune.GT, k=10, compute_loadings=False)

<decorator-gen-1549> in hwe_normalized_pca(call_expr, k, compute_loadings)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/methods/statgen.py in hwe_normalized_pca(call_expr, k, compute_loadings)
   1593     return pca(normalized_gt,
   1594                k,
-> 1595                compute_loadings)
   1596 
   1597 

<decorator-gen-1551> in pca(entry_expr, k, compute_loadings)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/methods/statgen.py in pca(entry_expr, k, compute_loadings)
   1695         'entryField': field,
   1696         'k': k,
-> 1697         'computeLoadings': compute_loadings
   1698     })).persist())
   1699 

<decorator-gen-1095> in persist(self, storage_level)

/usr/local/lib/python3.7/dist-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    612     def wrapper(__original_func, *args, **kwargs):
    613         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 614         return __original_func(*args_, **kwargs_)
    615 
    616     return wrapper

/usr/local/lib/python3.7/dist-packages/hail/table.py in persist(self, storage_level)
   1834             Persisted table.
   1835         """
-> 1836         return Env.backend().persist_table(self, storage_level)
   1837 
   1838     def unpersist(self) -> 'Table':

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    313 
    314     def persist_table(self, t, storage_level):
--> 315         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    316 
    317     def unpersist_table(self, t):

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/local/lib/python3.7/dist-packages/hail/backend/spark_backend.py in deco(*args, **kwargs)
     39             raise FatalError('%s\n\nJava stack trace:\n%s\n'
     40                              'Hail version: %s\n'
---> 41                              'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
     42         except pyspark.sql.utils.CapturedException as e:
     43             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: SparkException: Job aborted due to stage failure: Task 10 in stage 3.0 failed 4 times, most recent failure: Lost task 10.3 in stage 3.0 (TID 90, saturn-3aae50e8-e65e-4fe4-b240-b4d8c7ad9956-w-18.c.mycompany-research-and-development.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  20.0 GB of 20 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 3.0 failed 4 times, most recent failure: Lost task 10.3 in stage 3.0 (TID 90, saturn-3aae50e8-e65e-4fe4-b240-b4d8c7ad9956-w-18.c.mycompany-research-and-development.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  20.0 GB of 20 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:166)
	at is.hail.rvd.RVD.countPerPartition(RVD.scala:744)
	at is.hail.expr.ir.MatrixValue.toRowMatrix(MatrixValue.scala:241)
	at is.hail.methods.PCA.execute(PCA.scala:33)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:49)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2409)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:23)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:402)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:401)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
	at is.hail.utils.package$.using(package.scala:601)
	at is.hail.annotations.Region$.scoped(Region.scala:18)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:401)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Job aborted due to stage failure: Task 10 in stage 3.0 failed 4 times, most recent failure: Lost task 10.3 in stage 3.0 (TID 90, saturn-3aae50e8-e65e-4fe4-b240-b4d8c7ad9956-w-18.c.mycompany-research-and-development.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  20.0 GB of 20 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Ah, I’m 99% sure this is fixed by https://github.com/hail-is/hail/pull/9009

We haven’t released since then, though. I’ll try to create a release this afternoon.

awesome!!

Is this a bugfix that specifically applies to the PCA function, or the prune function, or memory usage across the board?

This fixes a memory leak in infrastructure that was used in a few places – exportBGEN, ld_prune, PCA, and a couple other things that use block matrix / linear algebra.

Quite a bad bug :sweat:

1 Like

ah, we need to fix something else before we release. I’ll post updates here.

Sounds good, let me know when it is released!

The offices are closed today and we’re having a few issues with our CI system, so I think tomorrow morning is a good bet for a release.

Ah, I thought the change that fixed this bug was stacked on top of the current release, but it was actually in 0.2.47! I’m drafting the next release now, but you should be able to get going on 0.2.47.

So now I am using Hail 0.2.49 and it seems like hl.hwe_normalized_pca() may be working now but I have a question about run time.

Previously before the memory leaks were patched, the function would terminate soon after displaying “2020-07-14 01:36:49 Hail: INFO: hwe_normalized_pca: running PCA using 559712 variants.”

Now it is progressing to the next step “2020-07-14 03:09:56 Hail: INFO: pca: running PCA with 10 components…”

The issue is the runtime. So far it has been running over 24 hours. I started this run around 7pm on Monday 7/13. I am using a Terra notebook with 53 nodes. I debated stopping the run last night but the Jupyter kernel still seems to be reactive.

Is this sort of runtime normal for 39628 samples and 581323 variants or do you think I should terminate and attempt to rerun?

Is there a way to run this function with more verbosity?

Really nothing except the most massive (100s of TBs of data processed) pipelines should take a day on clusters that size.

PCA is quite sensitive to partitioning, though. The best place to get more information about this is either the hail log, or the Spark UI. Do you have the Hail log file? It will probably be big, but that’s the thing to look at first.

How would I get the Hail log file?

You’re running this in a notebook in terra, right? You can let the thing run for a few minutes, then interrupt that cell, and run:

hl.upload_log('gs://path/to/some/bucket/you/own')