Pca failed due to not enough executor memory


#1

Hello,

My pca test failed with the following error. The job is running on a Cloudera cluster using hail devel-7beaca7. I had to use the dev version as our VCF doesn’t have the PL field which was required for v.0.1 to load GT properly. What would you suggest I do? Where do I “boosting spark.yarn.executor.memoryOverhead” as the error message suggested?

Thanks very much!
Jerry

****** Hail command and error msg below **********
pca=vds.pca(‘scores=sa.pca’, loadings=None, eigenvalues=None, k=5, as_array=False)
2017-09-12 22:16:44 Hail: INFO: Running PCA with 5 components…
[Stage 0:> (0 + 128) / 140]---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
in ()
----> 1 pca=vds.pca(‘scores=sa.pca’, loadings=None, eigenvalues=None, k=5, as_array=False)

in pca(self, scores, loadings, eigenvalues, k, as_array)

PATH/hail-devel-7beaca7/build/distributions/hail-python.zip/hail/java.pyc in handle_py4j(func, *args, **kwargs)
119 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
120 ‘Hail version: %s\n’
–> 121 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
122 except py4j.protocol.Py4JError as e:
123 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at is.hail.stats.ToHWENormalizedIndexedRowMatrix$.apply(ComputeRRM.scala:104)
at is.hail.methods.SamplePCA$.variantsSvdAndScores(SamplePCA.scala:51)
at is.hail.methods.SamplePCA$.apply(SamplePCA.scala:31)
at is.hail.variant.VariantDatasetFunctions$.pca$extension(VariantDataset.scala:508)
at is.hail.variant.VariantDatasetFunctions.pca(VariantDataset.scala:495)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: devel-7beaca7
Error summary: SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 4 times, most recent failure: Lost task 11.3 in stage 0.0 (TID 159, itmiprd1cdhd02.itmi.inova.org, executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 1.7 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:


#2

Hi Jerry,
The YARN defaults aren’t super great for a lot of the memory-intensive linear algebra routines we use. One way to make things a bit better is to increase the number of cores per executor from 1 to 4, so that all the overhead per JVM drops by a factor of 4. You can do this by setting spark.executor.cores=4 in the spark configuration.