Hi!
We are experiencing issues with several pipelines that rely on Hail. We run them on a Dataproc cluster in Google Cloud Platform, and we have observed that after an update to the Dataproc images we keep running into an error between Janino and the rest of dependencies. Specifically:
FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
.
These pipelines were functional before the version changes. More context in this Github issue.
To reproduce
- We start a Dataproc cluster and define specific parameters to configure Hail. Our custom Spark config to provide Hail support:
job.pyspark_job.properties = {
"spark.jars": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
"spark.driver.extraClassPath": "/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar",
"spark.executor.extraClassPath": "./hail-all-spark.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "is.hail.kryo.HailKryoRegistrator",
}
- I connect to the Jupyter notebook server of the cluster and try to process the gnomad’s variant information:
hl.init()
ht = hl.read_table(
"gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht",
_load_refs=False,
)
- It is when I convert the Hail table to a Spark dataframe that I see the error:
ht.select_globals().head(2).to_spark(flatten=False)
>>> ...
>>> Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Full traceback
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
/tmp/ipykernel_1112676/2283394357.py in <cell line: 1>()
----> 1 ht.head(2).select_globals().to_spark()
<decorator-gen-1382> in to_spark(self, flatten)
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
585 def wrapper(__original_func: Callable[..., T], *args, **kwargs) -> T:
586 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 587 return __original_func(*args_, **kwargs_)
588
589 return wrapper
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/table.py in to_spark(self, flatten)
3511
3512 """
-> 3513 return Env.spark_backend('to_spark').to_spark(self, flatten)
3514
3515 @typecheck_method(flatten=bool, types=dictof(oneof(str, hail_type), str))
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/spark_backend.py in to_spark(self, t, flatten)
297 if flatten:
298 t = t.flatten()
--> 299 return pyspark.sql.DataFrame(self._jbackend.pyToDF(self._to_java_table_ir(t._tir)), self._spark_session)
300
301 def register_ir_function(self, name, type_parameters, argument_names, argument_types, return_type, body):
/opt/conda/miniconda3/lib/python3.10/site-packages/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
33 tpl = Env.jutils().handleForPython(e.java_exception)
34 deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
---> 35 raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
36 except pyspark.sql.utils.CapturedException as e:
37 raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Java stack trace:
java.lang.NoClassDefFoundError: org/codehaus/janino/InternalCompilerException
at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
java.lang.ClassNotFoundException: org.codehaus.janino.InternalCompilerException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
at org.apache.spark.sql.catalyst.expressions.objects.GetExternalRowField.<init>(objects.scala:1841)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:195)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:361)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:357)
at is.hail.expr.ir.TableValue.toDF(TableValue.scala:149)
at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$2(SparkBackend.scala:590)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:76)
at is.hail.utils.package$.using(package.scala:637)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:62)
at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$1(SparkBackend.scala:345)
at is.hail.backend.spark.SparkBackend.$anonfun$pyToDF$1(SparkBackend.scala:589)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.pyToDF(SparkBackend.scala:588)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Hail version: 0.2.124-13536b531342
Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
Do you know about this compatibility issue? My gut feeling is that our Spark configuration for Hail is causing a conflict. Any help is greatly appreciated.
Thanks!
Irene