Java error when loading VCF

I am trying to load a 1000 genome VCF and I get this error. Seems like it is trying to use Spark1? Maybe? Just guessing.

Hail version: 0.1-f57188f
Error summary: ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

What version of Spark is installed on your cluster? Did you supply -Dspark.version when compiling Hail? If not, Hail is compiled for Spark 2.0.2. Hail requires a Spark cluster of version at least 2.0.2. If you’re using a version of Spark other than 2.0.2 (or if you’re using a cloudera Spark) you must specify it at compile time, for example:

./gradlew -Dspark.version=2.1.1 clean shadowJar

For cloudera clusters, you need something like:

./gradlew -Dspark.version=2.0.2.cloudera clean shadowJar

Also, if you have multiple Spark installs, insure you’re using spark2-submit/pyspark2 not spark-submit/pyspark.

Let me know if none of this fixes your issue!

@danking I think you added a check for that in HailContext construction.

I vaguely remember seeing this error when the jar is visible on the driver node, but not the executors.

I am trying to run this from an Edgenode that only has the gateway installed.

Can you check if the file in SPARK_CLASSPATH can be ls-ed from the worker machines? I suspect it’s not in a network-visible file system, which has caused this error in the past.

I’ve seen this happen before when running python instead of pyspark/pyspark2.

I clearly have something mis-configured. I guess I don’t understand what is required as a config for the Spark worker nodes.

When I run pyspark like is I get this response:
[kmlong@lsa12-dn0 ~]$ pyspark2 --jars $HAIL_HOME/build/libs/hail-all-spark.jar --py-files $HAIL_HOME/python/hail-python.zip --conf spark.hadoop.io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec --conf spark.sql.files.openCostInBytes=1099511627776 --conf spark.sql.files.maxPartitionBytes=1099511627776 --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize=1099511627776 --conf spark.hadoop.parquet.block.size=1099511627776
WARNING: User-defined SPARK_HOME (/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2) overrides detected (/opt/cloudera/parcels/SPARK2/lib/spark2).
WARNING: Running pyspark from user-defined location.
Python 2.7.13 (default, Jul 14 2017, 11:59:49)
[GCC 5.3.1 20160406 (Red Hat 5.3.1-6)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/07/17 16:04:01 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to ‘/opt/hail/build/libs/hail-all-spark.jar’).
This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --driver-class-path to augment the driver classpath
  • spark.executor.extraClassPath to augment the executor classpath

17/07/17 16:04:01 WARN spark.SparkConf: Setting ‘spark.executor.extraClassPath’ to ‘/opt/hail/build/libs/hail-all-spark.jar’ as a work-around.
17/07/17 16:04:01 WARN spark.SparkConf: Setting ‘spark.driver.extraClassPath’ to ‘/opt/hail/build/libs/hail-all-spark.jar’ as a work-around.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .
_/_,// //_\ version 2.1.0.cloudera1
/
/

Using Python version 2.7.13 (default, Jul 14 2017 11:59:49)
SparkSession available as ‘spark’.

from hail import *
hc = HailContext()
Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in init
File “/opt/hail/python/hail-python.zip/hail/typecheck/check.py”, line 202, in _typecheck
File “/opt/hail/python/hail-python.zip/hail/context.py”, line 83, in init
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/pyspark/sql/utils.py”, line 63, in deco
return f(*a, **kw)
File “/log/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o57.apply.
: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2278)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$2.apply(SparkContext.scala:2274)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2274)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2360)
at org.apache.spark.SparkContext.(SparkContext.scala:85)
at is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:88)
at is.hail.HailContext$.apply(HailContext.scala:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

In this case, construct the hail context by passing in the existing spark context:

hc = HailContext(sc)

Ok. That worked. Thank you.