Pca failed due to not enough executor memory

Hello,

My pca test failed with the following error. The job is running on a Cloudera cluster using hail devel-7beaca7. I had to use the dev version as our VCF doesn’t have the PL field which was required for v.0.1 to load GT properly. What would you suggest I do? Where do I “boosting spark.yarn.executor.memoryOverhead” as the error message suggested?

Thanks very much!
Jerry

****** Hail command and error msg below **********
pca=vds.pca(‘scores=sa.pca’, loadings=None, eigenvalues=None, k=5, as_array=False)
2017-09-12 22:16:44 Hail: INFO: Running PCA with 5 components…
[Stage 0:> (0 + 128) / 140]---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
in ()
----> 1 pca=vds.pca(‘scores=sa.pca’, loadings=None, eigenvalues=None, k=5, as_array=False)

in pca(self, scores, loadings, eigenvalues, k, as_array)

PATH/hail-devel-7beaca7/build/distributions/hail-python.zip/hail/java.pyc in handle_py4j(func, *args, **kwargs)
119 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
120 ‘Hail version: %s\n’
–> 121 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
122 except py4j.protocol.Py4JError as e:
123 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at is.hail.stats.ToHWENormalizedIndexedRowMatrix$.apply(ComputeRRM.scala:104)
at is.hail.methods.SamplePCA$.variantsSvdAndScores(SamplePCA.scala:51)
at is.hail.methods.SamplePCA$.apply(SamplePCA.scala:31)
at is.hail.variant.VariantDatasetFunctions$.pca$extension(VariantDataset.scala:508)
at is.hail.variant.VariantDatasetFunctions.pca(VariantDataset.scala:495)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: devel-7beaca7
Error summary: SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:

Hi Jerry,
The YARN defaults aren’t super great for a lot of the memory-intensive linear algebra routines we use. One way to make things a bit better is to increase the number of cores per executor from 1 to 4, so that all the overhead per JVM drops by a factor of 4. You can do this by setting spark.executor.cores=4 in the spark configuration.

@tpoterba I am having the same error as OP and I would like to increase the # of executor cores but I am using Hail in Terra.

I know there is some collaboration between Hail and Terra with regards to a Terra Base Image (which I am currently using).

Do you know if there is a custom image I can use in Terra that includes Hail & Spark, with the executor.cores parameter changed as you describe? Is it simple enough to change this parameter in a Dockerfile if I was to make a custom image for my Terra notebook?

You should be able to pass this parameter in hail initialization:

import hail as hl
hl.init(spark_conf={'spark.executor.cores': '4'})

Thanks @tpoterba. I later figured that out and was able to increase the # of executor cores but unfortunately I am still getting SparkExceptions on this PCA run.

I have created a new line of inquiry at PCA job aborted from SparkException so as not to sidetrack this older thread.

Do you think the issue is in Spark configurations or in the cluster resources? I am currently running with 24 nodes (12 of which are preemptible) but I am experimenting with increasing compute resources even further.

@tpoterba any updates about this?

we’re actively discussing here: PCA job aborted from SparkException

Gosh! My apologies for not reading this thread as well as I should have! Thanks, Tim! :slight_smile: