Hi,
I tried to run spark-shell
on Dataproc so that I could interactively type Scala commands using Hail. First, I SSH’d to the master node of the cluster. Then, I downloaded the Hail JAR to hail.jar
. I read the docs for spark-shell
and it sounded like --jars
added JARs to the class path on my executors and the driver, so I tried this:
spark-shell --jars './hail.jar'
When I tried to create a hail context:
import is.hail._
val hc = HailContext()
I got some errors about spark.sql.files.openCostInBytes
needing to be set to 50G
. I tried a few values before realizing that 50G
meant 50GiB
. After this I tried:
spark-shell --jars './hail.jar' --conf='spark.sql.files.openCostInBytes=53687091200' --conf='spark.sql.files.maxPartitionBytes=53687091200'
With this invocation, I managed to create a hail context, but when I tried to run filterVariantsExpr
:
val vds = hc.read("gs://danking/ALL.1KG.800K-1204-partitions.vds")
val filtered = vds.splitMulti().variantQC().filterVariantsExpr("va.qc.AF > 0.05 && va.qc.AF < 0.95")
val filteredPruned = filtered.ldPrune()
I got a cryptic error about class loading:
[Stage 3:> (0 + 87) / 1024]org.apache.spark.SparkException: Job aborted due to stage failure: Task 77 in stage 3.0 failed 4 times, most recent failure: Lost task 77.3 in stage 3.0 (TID 2264, dan-1-sw-p3sn.c.broad-ctsa.internal): java.lang.RuntimeException: Failed to define or load class, check logs for the exception from defineClass.
at is.hail.asm4s.package$HailClassLoader$.liftedTree1$1(package.scala:277)
at is.hail.asm4s.package$HailClassLoader$.loadOrDefineClass(package.scala:253)
at is.hail.asm4s.package$.loadClass(package.scala:287)
at is.hail.asm4s.FunctionBuilder$$anon$2.apply(FunctionBuilder.scala:218)
at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:74)
at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:72)
at is.hail.expr.Parser$$anonfun$is$hail$expr$Parser$$evalNoTypeCheck$1.apply(Parser.scala:53)
at is.hail.expr.Parser$$anonfun$parseTypedExpr$1.apply(Parser.scala:81)
at is.hail.variant.VariantSampleMatrix$$anonfun$88.apply(VariantSampleMatrix.scala:1036)
at is.hail.variant.VariantSampleMatrix$$anonfun$88.apply(VariantSampleMatrix.scala:1032)
at is.hail.variant.VariantSampleMatrix$$anonfun$filterVariants$1.apply(VariantSampleMatrix.scala:941)
at is.hail.variant.VariantSampleMatrix$$anonfun$filterVariants$1.apply(VariantSampleMatrix.scala:941)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at is.hail.sparkextras.OrderedRDD$$anonfun$apply$7$$anon$2.hasNext(OrderedRDD.scala:211)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: is.hail.codegen.generated.C0
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at is.hail.asm4s.package$HailClassLoader$.liftedTree1$1(package.scala:259)
... 25 more
I’m not sure how to resolve this!