Incompatibility between Hail and Spark 3.3.2

Hi!

We are experiencing issues with several pipelines that rely on Hail. We run them on a Dataproc cluster in Google Cloud Platform, and we have observed that after an update to the Dataproc images we keep running into an error between Janino and the rest of dependencies. Specifically:
FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException.

These pipelines were functional before the version changes. More context in this Github issue.

To reproduce

  1. We start a Dataproc cluster and define specific parameters to configure Hail. Our custom Spark config to provide Hail support:
    job.pyspark_job.properties = {
        "spark.jars": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
        "spark.driver.extraClassPath": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
        "spark.executor.extraClassPath": "./hail-all-spark.jar",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "is.hail.kryo.HailKryoRegistrator",
    }
  1. I connect to the Jupyter notebook server of the cluster and try to process the gnomad’s variant information:
hl.init()
ht = hl.read_table(
            "gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht",
            _load_refs=False,
        )
  1. It is when I convert the Hail table to a Spark dataframe that I see the error:
ht.select_globals().head(2).to_spark(flatten=False)
>>> ...
>>> Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Full traceback
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
/tmp/ipykernel_1112676/2283394357.py in <cell line: 1>()
----> 1 ht.head(2).select_globals().to_spark()

<decorator-gen-1382> in to_spark(self, flatten)

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    585     def wrapper(__original_func: Callable[..., T], *args, **kwargs) -> T:
    586         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 587         return __original_func(*args_, **kwargs_)
    588 
    589     return wrapper

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/table.py in to_spark(self, flatten)
   3511 
   3512         """
-> 3513         return Env.spark_backend('to_spark').to_spark(self, flatten)
   3514 
   3515     @typecheck_method(flatten=bool, types=dictof(oneof(str, hail_type), str))

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/spark_backend.py in to_spark(self, t, flatten)
    297         if flatten:
    298             t = t.flatten()
--> 299         return pyspark.sql.DataFrame(self._jbackend.pyToDF(self._to_java_table_ir(t._tir)), self._spark_session)
    300 
    301     def register_ir_function(self, name, type_parameters, argument_names, argument_types, return_type, body):

/opt/conda/miniconda3/lib/python3.10/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 

/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     33             tpl = Env.jutils().handleForPython(e.java_exception)
     34             deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
---> 35             raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
     36         except pyspark.sql.utils.CapturedException as e:
     37             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException

Java stack trace:
java.lang.NoClassDefFoundError: org/codehaus/janino/InternalCompilerException
	at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
	at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

java.lang.ClassNotFoundException: org.codehaus.janino.InternalCompilerException
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
	at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)




Hail version: 0.2.124-13536b531342
Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException

Do you know about this compatibility issue? My gut feeling is that our Spark configuration for Hail is causing a conflict. Any help is greatly appreciated.
Thanks!
Irene

hi, are you able to try downgrading hail to version 0.2.122? we had a bug that broke dataproc notebooks (but not scripts submitted directly to dataproc) that was introduced in 0.2.123, and should be fixed in 0.2.125, but the release for 0.2.125 is still in progress.

the bug caused any spark jobs created by hail to stall and never complete, and the fix for it was reverting some changes to our gradle configuration that apparently added a bunch of spark classes into the hail jar. while we didn’t encounter the error message you’re running into, i think it might be the same root cause.

if that doesn’t fix the issue, please let us know so we can take a closer look. thanks!

Downgrading has solved the problem @iris-garden, thank you so much!
We’ll test this on 0.2.125 once it is out.
Best,
Irene