Java error when loading VCF

I am trying to load a 1000 genome VCF and I get this error. Seems like it is trying to use Spark1? Maybe? Just guessing.

Hail version: 0.1-f57188f
Error summary: ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

What version of Spark is installed on your cluster? Did you supply -Dspark.version when compiling Hail? If not, Hail is compiled for Spark 2.0.2. Hail requires a Spark cluster of version at least 2.0.2. If you’re using a version of Spark other than 2.0.2 (or if you’re using a cloudera Spark) you must specify it at compile time, for example:

./gradlew -Dspark.version=2.1.1 clean shadowJar

For cloudera clusters, you need something like:

./gradlew -Dspark.version=2.0.2.cloudera clean shadowJar

Also, if you have multiple Spark installs, insure you’re using spark2-submit/pyspark2 not spark-submit/pyspark.

Let me know if none of this fixes your issue!

@danking I think you added a check for that in HailContext construction.

I vaguely remember seeing this error when the jar is visible on the driver node, but not the executors.

I am trying to run this from an Edgenode that only has the gateway installed.

Can you check if the file in SPARK_CLASSPATH can be ls-ed from the worker machines? I suspect it’s not in a network-visible file system, which has caused this error in the past.

I’ve seen this happen before when running python instead of pyspark/pyspark2.

I clearly have something mis-configured. I guess I don’t understand what is required as a config for the Spark worker nodes.

When I run pyspark like is I get this response:
[kmlong@lsa12-dn0 ~]$ pyspark2 --jars $HAIL_HOME/build/libs/hail-all-spark.jar --py-files $HAIL_HOME/python/ --conf,, --conf spark.sql.files.openCostInBytes=1099511627776 --conf spark.sql.files.maxPartitionBytes=1099511627776 --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=1099511627776 --conf spark.hadoop.parquet.block.size=1099511627776
WARNING: User-defined SPARK_HOME (/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
WARNING: Running pyspark from user-defined location.
Python 2.7.13 (default, Jul 14 2017, 11:59:49)
[GCC 5.3.1 20160406 (Red Hat 5.3.1-6)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/07/17 16:04:01 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to ‘/opt/hail/build/libs/hail-all-spark.jar’).
This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --driver-class-path to augment the driver classpath
  • spark.executor.extraClassPath to augment the executor classpath

17/07/17 16:04:01 WARN spark.SparkConf: Setting ‘spark.executor.extraClassPath’ to ‘/opt/hail/build/libs/hail-all-spark.jar’ as a work-around.
17/07/17 16:04:01 WARN spark.SparkConf: Setting ‘spark.driver.extraClassPath’ to ‘/opt/hail/build/libs/hail-all-spark.jar’ as a work-around.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ .
_/_,// //_\ version 2.1.0.cloudera1

Using Python version 2.7.13 (default, Jul 14 2017 11:59:49)
SparkSession available as ‘spark’.

from hail import *
hc = HailContext()
Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in init
File “/opt/hail/python/”, line 202, in _typecheck
File “/opt/hail/python/”, line 83, in init
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/”, line 1133, in call
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/pyspark/sql/”, line 63, in deco
return f(*a, **kw)
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/”, line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o57.apply.
: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2278)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2274)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2274)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2360)
at org.apache.spark.SparkContext.(SparkContext.scala:85)
at is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:88)
at is.hail.HailContext$.apply(HailContext.scala:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(

In this case, construct the hail context by passing in the existing spark context:

hc = HailContext(sc)

Ok. That worked. Thank you.