Hi!
We are experiencing issues with several pipelines that rely on Hail. We run them on a Dataproc cluster in Google Cloud Platform, and we have observed that after an update to the Dataproc images we keep running into an error between Janino and the rest of dependencies. Specifically:
FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException.
These pipelines were functional before the version changes. More context in this Github issue.
To reproduce
- We start a Dataproc cluster and define specific parameters to configure Hail. Our custom Spark config to provide Hail support:
    job.pyspark_job.properties = {
        "spark.jars": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
        "spark.driver.extraClassPath": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
        "spark.executor.extraClassPath": "./hail-all-spark.jar",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.kryo.registrator": "is.hail.kryo.HailKryoRegistrator",
    }
- I connect to the Jupyter notebook server of the cluster and try to process the gnomad’s variant information:
hl.init()
ht = hl.read_table(
            "gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht",
            _load_refs=False,
        )
- It is when I convert the Hail table to a Spark dataframe that I see the error:
ht.select_globals().head(2).to_spark(flatten=False)
>>> ...
>>> Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Full traceback
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
/tmp/ipykernel_1112676/2283394357.py in <cell line: 1>()
----> 1 ht.head(2).select_globals().to_spark()
<decorator-gen-1382> in to_spark(self, flatten)
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    585     def wrapper(__original_func: Callable[..., T], *args, **kwargs) -> T:
    586         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 587         return __original_func(*args_, **kwargs_)
    588 
    589     return wrapper
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/table.py in to_spark(self, flatten)
   3511 
   3512         """
-> 3513         return Env.spark_backend('to_spark').to_spark(self, flatten)
   3514 
   3515     @typecheck_method(flatten=bool, types=dictof(oneof(str, hail_type), str))
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/spark_backend.py in to_spark(self, t, flatten)
    297         if flatten:
    298             t = t.flatten()
--> 299         return pyspark.sql.DataFrame(self._jbackend.pyToDF(self._to_java_table_ir(t._tir)), self._spark_session)
    300 
    301     def register_ir_function(self, name, type_parameters, argument_names, argument_types, return_type, body):
/opt/conda/miniconda3/lib/python3.10/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     33             tpl = Env.jutils().handleForPython(e.java_exception)
     34             deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
---> 35             raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
     36         except pyspark.sql.utils.CapturedException as e:
     37             raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Java stack trace:
java.lang.NoClassDefFoundError: org/codehaus/janino/InternalCompilerException
	at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
	at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
java.lang.ClassNotFoundException: org.codehaus.janino.InternalCompilerException
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
	at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
	at is.hail.utils.package$.using(package.scala:637)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Hail version: 0.2.124-13536b531342
Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Do you know about this compatibility issue? My gut feeling is that our Spark configuration for Hail is causing a conflict. Any help is greatly appreciated.
Thanks!
Irene