Issuing Spark-SQL on Hail Tables

#1

Hi All

Im using hail V2 0.2.10-a4870bf102a8 , Spark 2.3.0


Running on Apache Spark version 2.3.0.cloudera4
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.10-a4870bf102a8

What I am struggling with is trying to run Spark SQL on data from hail which I have converted to spark e.g.

sqc = SQLContext(sc)
final_vds=hl.import_vcf(vcf)
df=final_vds.rows().to_spark(flatten=True)
df.createOrReplaceTempView("mytable")
dss=sqc.sql("select * from mytable")

This does not seem to work as it complains that the view/table is not available. I can do this with an ordinary spark DataFrame which I produce from a spark ingest workflow, but not from Hail.

Should I be saving this back to HDFS then reading it back in as a spark DF and then doing SQL or is there a better way?

Thanks

0 Likes

#2

We don’t really know how Spark-SQL works. It’s confusing to me that the above code doesn’t do what you want.

Note that converting between Hail and Spark is really expensive, though. If you intend to run both Hail and Spark queries, you are probably better off storing a copy as a Hail table and a copy as a Spark DataFrame.

0 Likes