Potential compatibility issue with Spark 2.2.0?

yong · January 30, 2018, 4:55pm

Dear Hail Team!

Came across this issue and thought to bring it to you guys for help. I built the package from Git source (ver=0.1). Then I tested it on your Tutorial > Overview (https://hail.is/docs/stable/tutorials/hail-overview.html). Everything works fine until it gets to

df = vds.samples_table().to_pandas()

I haven’t got chance to dig deep into this but thought to run it by you first.
Any insights would be great.

Here is the Error Traceback:

AttributeError Traceback (most recent call last)
in ()
----> 1 df = vds.samples_table().to_pandas()
in to_pandas(self, expand, flatten)
/databricks/spark/python/lib/hail-python.zip/hail/java.py in handle_py4j(func, *args, **kwargs)
113 def handle_py4j(func, *args, **kwargs):
114 try:
–> 115 r = func(*args, **kwargs)
116 except py4j.protocol.Py4JJavaError as e:
117 tpl = Env.jutils().handleForPython(e.java_exception)
in to_pandas(self, expand, flatten)
/databricks/spark/python/lib/hail-python.zip/hail/typecheck/check.py in _typecheck(f, *args, **kwargs)
243 def _typecheck(f, *args, **kwargs):
244 check_all(f, args, kwargs, checkers, is_method=True)
–> 245 return f(*args, **kwargs)
246
247 return decorator(_typecheck)
/databricks/spark/python/lib/hail-python.zip/hail/keytable.py in to_pandas(self, expand, flatten)
642 “”"
643
–> 644 return self.to_dataframe(expand, flatten).toPandas()
645
646 @handle_py4j
/databricks/spark/python/pyspark/sql/dataframe.pyc in toPandas(self)
1705 “”“
1706 import pandas as pd
-> 1707 if self.sql_ctx.getConf(“spark.sql.execution.arrow.enable”, “false”).lower() == “true”:
1708 try:
1709 import pyarrow
/databricks/spark/python/pyspark/sql/context.pyc in getConf(self, key, defaultValue)
139 u’50’
140 “””
–> 141 return self.sparkSession.conf.get(key, defaultValue)
142
143 @property
AttributeError: ‘JavaMember’ object has no attribute ‘get’

tpoterba · January 30, 2018, 10:27pm

We really haven’t tested Hail 0.1 against Spark 2.2.0, and have seen problems in the past. Hail 0.2, which will be in beta in several weeks, will support 2.2.0 as the default version I think.

MPGN · March 2, 2018, 5:15pm

Spark 2.3.0 was just released on Feb 28th and comes with numerous bug fixes and improvements to SparkSQL and PySpark executions, as well as support for the ‘zstd’ compression method which offers ~38% improvement over the snappy default for Parquet files. Would it be possible to consider making Spark 2.3.0 the default in the next Hail version? We plan on using Spark 2.3 at my company for other pipelines, and keeping everything on the same version just makes managing software easier

Best,
Matt

tpoterba · March 2, 2018, 5:30pm

I think we’re pretty committed to 2.2.0 as the default version for the 0.2 beta version, and probably won’t start deploying 2.3 jars until Google Dataproc starts offering a 2.3 image.

We’re also diverging more and more from SparkSQL, and don’t use DataFrames to write parquet anymore due to poor performance going between the DataFrame and RDD APIs.

That said, I imagine Hail would work with 2.3 out of the box, though you’ll need to compile it yourself.

tpoterba · March 2, 2018, 5:34pm

Also, if you’re getting started with Hail now, I’d definitely recommend using the 0.2 development version!

mdghouse1986 · April 4, 2018, 5:01am

Hi,

“We’re also diverging more and more from SparkSQL, and don’t use DataFrames to write parquet anymore due to poor performance going between the DataFrame and RDD APIs.” ----- diverging towards Datasets or to RDD APIs? Because Spark recommends to use Datasets and Dataframes for any future development.

tpoterba · April 4, 2018, 12:05pm

Hail 0.2 is built entirely on the RDD API. 0.1 used RDD everywhere except to write/read parquet, and this caused huge problems with memory management and performance.

Dataframes don’t work for some kinds of computational tasks Hail needs to support – namely, extremely wide tables (sometimes up to 1000 fields) and matrix data.

mdghouse1986 · April 4, 2018, 10:03pm

So , 0.2 is still using RDDs as base abstraction?

tpoterba · April 4, 2018, 10:06pm

Yep! Really no choice, for the problems Hail is designed to solve.

ines · August 24, 2018, 5:40pm

Hi Tim,

now that a google dataproc image exists for spark 2.3, how to compile Hail 0.2 with this version of spark ?

Thanks

Ines

tpoterba · August 24, 2018, 5:45pm

see here:

Topic		Replies	Views
Hail 0.2.27 not running on Cloudera spark 2.4.0 Hail Query & hailctl	8	675	November 19, 2019
Using Hail on Spark 2.1.1 Azure HDInsight causes error Help [0.1]	16	2035	April 6, 2018
Spark version problem Hail Query & hailctl	3	910	April 18, 2019
AttributeError: 'DataFrame' object has no attribute 'to_spark' Help [0.1]	19	7512	August 1, 2018
Install Hail using Spark Hail Query & hailctl	15	1426	April 13, 2018

Potential compatibility issue with Spark 2.2.0?

Related topics