I want to run a spark shell with the Hail JAR on Google Dataproc, but I get errors

Hi there,

Yeah, this is pretty confusing. spark-shell --help says:

  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.

But, apparently, this does not include the local jars on the driver and executor class paths. You have to explicitly add the JARs to the class paths using these properties:

spark-shell --jars './hail.jar' \
  --conf='spark.sql.files.openCostInBytes=53687091200' \
  --conf='spark.sql.files.maxPartitionBytes=53687091200' \
  --conf='spark.driver.extraClassPath=./hail.jar' \
  --conf='spark.executor.extraClassPath=./hail.jar'

NB: It is important to note that the JARs are copied to the working directory of the executors, but are not copied to the working directory of the driver. Usually, the spark.driver.extraClassPath will be the same path you passed to --jars whereas spark.executor.extraClassPath must be a relative path.


This error sometimes manifests as:

ClassNotFoundException: is.hail.utils.SerializableHadoopConfiguration
1 Like