DataBricks 13.3-LTS with Spark 3.4.1

Hi all,

I’ve been playing around with DataBricks as a potential variant store solution for our team at the University of Melbourne Centre for Cancer Research.

With project glow not being updated for a while, it is out of date with the latest DataBricks runtime versions that use later versions of spark. This means in-turn it cannot access more recent features of DataBricks such as Unity Catalog.

With the 3.4 Spark / Hail compatibility issue showing that hail can easily support spark 3.4 with a few small changes, means that we can use the latest LTS runtime for DataBricks (13.3-LTS) with hail!

I’ve managed to build a docker container backed onto the latest version of DataBricks. See my docker-who repo for more info.

I’m also looking into Unity Volumes for non-tabular data (like MatrixTables). Unfortunately Unity Volumes are still in public preview and require a bit of file copy manipulation to work with hail (for now).

I’m excited to see if Unity Volumes do support hail read/write directly soon.

I will continue to post my findings to this topic for those interested in using hail on DataBricks.

Links posted in comments below!

Alexis

1 Like

Links

Because Discuss wouldn’t let me put more than two in (as a new user)

The repo with the DataBricks 13.3 LTS / Hail 3.4.1 that can be attached to a DataBricks Personal Compute Cluster - docker-who/repositories/hail/0.2.126–spark-3.4.1-patch at main · umccr/docker-who (github.com)

The 3.4.1 compatibility patch - issue here

DataBricks Unity Volumes - Unity Volumes

The project-glow repository projectglow