“sparkContext was shut down” while running hail/pyspark on a large dataset

I use Hail version: -0d9d9fa, import_vcf() to vds. working on 10K sample (chr1-22, chrX,chrY, total size around 3.3T)
hc.import_vcf('gs://path/raw_data/*.vcf.bgz').write('gs://path/step1/merged.vds')

I use dataproc cluster:

gcloud dataproc clusters create hailcluster
–region europe-west1
–zone europe-west1-d
–master-machine-type n1-highmem-16
–master-boot-disk-size 200
–num-workers 6
–worker-machine-type n1-highmem-16
–worker-boot-disk-size 100
–num-worker-local-ssds 1
–num-secondary-workers 15
–image-version 1.1
–project myproject
–max-idle pt10m
–properties “spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=100g,spark:spark.driver.maxResultSize=60g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1”
–initialization-actions gs://hail-common/hail-init.sh

gcloud dataproc jobs submit pyspark
–cluster=hailcluster
–files=gs://hail-common/builds/0.1/jars/hail-0.1-0d9d9fa-Spark-2.0.2.jar
–py-files=gs://hail-common/builds/0.1/python/hail-0.1-0d9d9fa.zip
–properties=“spark.driver.extraClassPath=./hail-0.1-0d9d9fa-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-0d9d9fa-Spark-2.0.2.jar”
gs://path/scripts/hail_import_vcf_28072020.py

Error log:
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version -0d9d9fa

[Stage 0:> (0 + 0) / 59933]

[Stage 0:========> (11196 + 2147) / 59933]hail: info: No multiallelics detected.

[Stage 0:=========> (11339 + 2192) / 59933]hail: info: Coerced sorted dataset

[Stage 0:==========> (12994 + 3358) / 59933]
[Stage 1:> (347 + -84) / 59933]

[Stage 1:> (347 + -84) / 59933][Stage 2:> (4717 + 893) / 59933]Traceback (most recent call last):
File “/tmp/4bfea3c4d6d2410881ea3136dbff8041/hail_import_vcf_28072020.py”, line 3, in
hc.import_vcf(‘gs://path/raw_data/*.vcf.bgz’).write(‘gs://path/step1/merged.vds’)
File “”, line 2, in write
File “/home/ec2-user/BuildAgent/work/179f3a9ad532f105/python/hail/java.py”, line 112, in handle_py4j
hail.java.FatalError: SparkException: Job 2 cancelled because SparkContext was shut down

Java stack trace:
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:525)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
at is.hail.variant.VariantDatasetFunctions$.write$extension(VariantDataset.scala:734)
at is.hail.variant.VariantDatasetFunctions.write(VariantDataset.scala:704)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)org.apache.spark.SparkException: Job 2 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:816)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:816)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1685)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1604)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1781)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1290)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1780)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:108)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:525)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
at is.hail.variant.VariantDatasetFunctions$.write$extension(VariantDataset.scala:734)
at is.hail.variant.VariantDatasetFunctions.write(VariantDataset.scala:704)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: -0d9d9fa
Error summary: SparkException: Job 2 cancelled because SparkContext was shut down

I guess it is a OOM exceptions, I already increased spark:spark.driver.memory=100g still not work, maybe I need to update master node machine to n1-highmem-32, spark:spark.driver.memory=200g? Do I need to adjust other flags?

Any recommends are welcome, Thanks.

Hail 0.1 is no longer supported. Can you switch to the latest version of 0.2?

Hi, Thanks Tim, for this task we plan to use Hail0.1.
If you have some idea for this, I really appreciate your help. I found a similar question in forum but no answer yet.

I use the same setting deal just (chr11-20), succeed. data size is around 1/3 of my whole dataset.

Then I try to handle whole dataset. I increase driver.mem to 600g, spark.driver.maxResultSize=180g, master node –master-machine-type n1-highmem-96, –worker-machine-type n1-highmem-32, it still report error. at stage2, finished half of work (30000 out of 60000).

Hi Shuang.

As much as we loved Hail 0.1, Hail 0.2 works a lot better and would be the only version that we currently support. What we would highly recommend is for you to change all of your vds files into vcfs, and import them into Hail 0.2 as matrixtables. I apologize for the inconvenience caused but that would be one of our limitations in support.

If you need help with Hail 0.1 to 0.2 movement of files, I am happy to help!