Potential compatibility issue with Spark 2.2.0?

Dear Hail Team!

Came across this issue and thought to bring it to you guys for help. I built the package from Git source (ver=0.1). Then I tested it on your Tutorial > Overview (https://hail.is/docs/stable/tutorials/hail-overview.html). Everything works fine until it gets to

df = vds.samples_table().to_pandas()

I haven’t got chance to dig deep into this but thought to run it by you first.
Any insights would be great.

Here is the Error Traceback:


AttributeError Traceback (most recent call last)
in ()
----> 1 df = vds.samples_table().to_pandas()
in to_pandas(self, expand, flatten)
/databricks/spark/python/lib/hail-python.zip/hail/java.py in handle_py4j(func, *args, **kwargs)
113 def handle_py4j(func, *args, **kwargs):
114 try:
–> 115 r = func(*args, **kwargs)
116 except py4j.protocol.Py4JJavaError as e:
117 tpl = Env.jutils().handleForPython(e.java_exception)
in to_pandas(self, expand, flatten)
/databricks/spark/python/lib/hail-python.zip/hail/typecheck/check.py in _typecheck(f, *args, **kwargs)
243 def _typecheck(f, *args, **kwargs):
244 check_all(f, args, kwargs, checkers, is_method=True)
–> 245 return f(*args, **kwargs)
246
247 return decorator(_typecheck)
/databricks/spark/python/lib/hail-python.zip/hail/keytable.py in to_pandas(self, expand, flatten)
642 “”"
643
–> 644 return self.to_dataframe(expand, flatten).toPandas()
645
646 @handle_py4j
/databricks/spark/python/pyspark/sql/dataframe.pyc in toPandas(self)
1705 “”“
1706 import pandas as pd
-> 1707 if self.sql_ctx.getConf(“spark.sql.execution.arrow.enable”, “false”).lower() == “true”:
1708 try:
1709 import pyarrow
/databricks/spark/python/pyspark/sql/context.pyc in getConf(self, key, defaultValue)
139 u’50’
140 “””
–> 141 return self.sparkSession.conf.get(key, defaultValue)
142
143 @property
AttributeError: ‘JavaMember’ object has no attribute ‘get’

We really haven’t tested Hail 0.1 against Spark 2.2.0, and have seen problems in the past. Hail 0.2, which will be in beta in several weeks, will support 2.2.0 as the default version I think.

Spark 2.3.0 was just released on Feb 28th and comes with numerous bug fixes and improvements to SparkSQL and PySpark executions, as well as support for the ‘zstd’ compression method which offers ~38% improvement over the snappy default for Parquet files. Would it be possible to consider making Spark 2.3.0 the default in the next Hail version? We plan on using Spark 2.3 at my company for other pipelines, and keeping everything on the same version just makes managing software easier :slight_smile:

Best,
Matt

I think we’re pretty committed to 2.2.0 as the default version for the 0.2 beta version, and probably won’t start deploying 2.3 jars until Google Dataproc starts offering a 2.3 image.

We’re also diverging more and more from SparkSQL, and don’t use DataFrames to write parquet anymore due to poor performance going between the DataFrame and RDD APIs.

That said, I imagine Hail would work with 2.3 out of the box, though you’ll need to compile it yourself.

Also, if you’re getting started with Hail now, I’d definitely recommend using the 0.2 development version!

Hi,

“We’re also diverging more and more from SparkSQL, and don’t use DataFrames to write parquet anymore due to poor performance going between the DataFrame and RDD APIs.” ----- diverging towards Datasets or to RDD APIs? Because Spark recommends to use Datasets and Dataframes for any future development.

Hail 0.2 is built entirely on the RDD API. 0.1 used RDD everywhere except to write/read parquet, and this caused huge problems with memory management and performance.

Dataframes don’t work for some kinds of computational tasks Hail needs to support – namely, extremely wide tables (sometimes up to 1000 fields) and matrix data.

So , 0.2 is still using RDDs as base abstraction?

Yep! Really no choice, for the problems Hail is designed to solve.

Hi Tim,

now that a google dataproc image exists for spark 2.3, how to compile Hail 0.2 with this version of spark ?

Thanks

Ines

see here: