Issuing Spark-SQL on Hail Tables

zzi · March 18, 2019, 3:58pm

Hi All

Im using hail V2 0.2.10-a4870bf102a8 , Spark 2.3.0


Running on Apache Spark version 2.3.0.cloudera4
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.10-a4870bf102a8

What I am struggling with is trying to run Spark SQL on data from hail which I have converted to spark e.g.

sqc = SQLContext(sc)
final_vds=hl.import_vcf(vcf)
df=final_vds.rows().to_spark(flatten=True)
df.createOrReplaceTempView("mytable")
dss=sqc.sql("select * from mytable")

This does not seem to work as it complains that the view/table is not available. I can do this with an ordinary spark DataFrame which I produce from a spark ingest workflow, but not from Hail.

Should I be saving this back to HDFS then reading it back in as a spark DF and then doing SQL or is there a better way?

Thanks

tpoterba · March 18, 2019, 4:14pm

We don’t really know how Spark-SQL works. It’s confusing to me that the above code doesn’t do what you want.

Note that converting between Hail and Spark is really expensive, though. If you intend to run both Hail and Spark queries, you are probably better off storing a copy as a Hail table and a copy as a Spark DataFrame.

Topic		Replies	Views
Hail 0.2: export to parquet? Hail Query & hailctl	2	881	August 15, 2018
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	750	February 22, 2022
Is there a recommend Hail 0.2 commit version? Help [0.1]	17	1606	September 13, 2018
Hail no longer supports Spark 2.? Hail Query & hailctl	10	435	April 5, 2021
Initialise Hail with existing Spark Hail Query & hailctl	3	520	May 9, 2023

Issuing Spark-SQL on Hail Tables

Related topics