I want to run a spark shell with the Hail JAR on Google Dataproc, but I get errors


#1

Hi,

I tried to run spark-shell on Dataproc so that I could interactively type Scala commands using Hail. First, I SSH’d to the master node of the cluster. Then, I downloaded the Hail JAR to hail.jar. I read the docs for spark-shell and it sounded like --jars added JARs to the class path on my executors and the driver, so I tried this:

spark-shell --jars './hail.jar'

When I tried to create a hail context:

import is.hail._
val hc = HailContext()

I got some errors about spark.sql.files.openCostInBytes needing to be set to 50G. I tried a few values before realizing that 50G meant 50GiB. After this I tried:

spark-shell --jars './hail.jar' --conf='spark.sql.files.openCostInBytes=53687091200' --conf='spark.sql.files.maxPartitionBytes=53687091200'

With this invocation, I managed to create a hail context, but when I tried to run filterVariantsExpr:

val vds = hc.read("gs://danking/ALL.1KG.800K-1204-partitions.vds")
val filtered = vds.splitMulti().variantQC().filterVariantsExpr("va.qc.AF > 0.05 && va.qc.AF < 0.95")
val filteredPruned = filtered.ldPrune()

I got a cryptic error about class loading:

[Stage 3:>                                                      (0 + 87) / 1024]org.apache.spark.SparkException: Job aborted due to stage failure: Task 77 in stage 3.0 failed 4 times, most recent failure: Lost task 77.3 in stage 3.0 (TID 2264, dan-1-sw-p3sn.c.broad-ctsa.internal): java.lang.RuntimeException: Failed to define or load class, check logs for the exception from defineClass.
	at is.hail.asm4s.package$HailClassLoader$.liftedTree1$1(package.scala:277)
	at is.hail.asm4s.package$HailClassLoader$.loadOrDefineClass(package.scala:253)
	at is.hail.asm4s.package$.loadClass(package.scala:287)
	at is.hail.asm4s.FunctionBuilder$$anon$2.apply(FunctionBuilder.scala:218)
	at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:74)
	at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:72)
	at is.hail.expr.Parser$$anonfun$is$hail$expr$Parser$$evalNoTypeCheck$1.apply(Parser.scala:53)
	at is.hail.expr.Parser$$anonfun$parseTypedExpr$1.apply(Parser.scala:81)
	at is.hail.variant.VariantSampleMatrix$$anonfun$88.apply(VariantSampleMatrix.scala:1036)
	at is.hail.variant.VariantSampleMatrix$$anonfun$88.apply(VariantSampleMatrix.scala:1032)
	at is.hail.variant.VariantSampleMatrix$$anonfun$filterVariants$1.apply(VariantSampleMatrix.scala:941)
	at is.hail.variant.VariantSampleMatrix$$anonfun$filterVariants$1.apply(VariantSampleMatrix.scala:941)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
	at is.hail.sparkextras.OrderedRDD$$anonfun$apply$7$$anon$2.hasNext(OrderedRDD.scala:211)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: is.hail.codegen.generated.C0
	at java.lang.ClassLoader.findClass(ClassLoader.java:530)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at is.hail.asm4s.package$HailClassLoader$.liftedTree1$1(package.scala:259)
	... 25 more

I’m not sure how to resolve this!


#2

Hi there,

Yeah, this is pretty confusing. spark-shell --help says:

  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.

But, apparently, this does not include the local jars on the driver and executor class paths. You have to explicitly add the JARs to the class paths using these properties:

spark-shell --jars './hail.jar' \
  --conf='spark.sql.files.openCostInBytes=53687091200' \
  --conf='spark.sql.files.maxPartitionBytes=53687091200' \
  --conf='spark.driver.extraClassPath=./hail.jar' \
  --conf='spark.executor.extraClassPath=./hail.jar'

NB: It is important to note that the JARs are copied to the working directory of the executors, but are not copied to the working directory of the driver. Usually, the spark.driver.extraClassPath will be the same path you passed to --jars whereas spark.executor.extraClassPath must be a relative path.


This error sometimes manifests as:

ClassNotFoundException: is.hail.utils.SerializableHadoopConfiguration

ClassNotFoundException: is.hail.asm4s.AsmFunction2