How should I use Hail on the DNANexus RAP?

How should I use Hail on the DNANexus RAP? I understand that there are at least two different file systems: the Project File System or “File API” and the Database File System or “Database API”.

importing data

The project file system is used for files not produced by Hail. Examples include VCF or BGEN files. You should always access this data through the DNANexus Project File System FUSE mount: file:///mnt/project/.... Do not download this data onto HDFS.

The project file system can handle neither Hail Tables nor Hail Matrix Tables; however, DNANexus provides the database file system for storing Table and Matrix Table files. A “database” in the database file system is roughly analogous to a “bucket” in Amazon S3 or Google Cloud Storage. You can create a “database” by executing this Python code:

import pyspark
import dxpy

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql("CREATE DATABASE my_database LOCATION  'dnax://'")

You only need to do this once. Every time you initialize Hail, you must use this database as the temporary directory and you must start a Spark SQL Session before you initialize Hail.

import pyspark
import dxpy
import hail as hl

my_database = dxpy.find_one_data_object(name="my_database")["id"]
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc, tmp_dir=f'dnax://{my_database}/tmp/')

Moreover, any time you want to write a Table or Matrix Table you need to use a path that starts with f'dnax://{my_database}'.

NB: You cannot inspect the contents of a DNANexus database from the web UI, so be certain to remember all the file paths you’ve used!

exporting data

If you export a merged TSV or VCF and want to move it from the database file system to the project file system, execute this shell command:

hadoop fs -cat dnax://my_database/path | dx upload -o /path/in/project

If you need to export a results Table or Matrix Table (which are comprised of many individual files), configure your DNANexus sparks cluster to directly access an S3 bucket you own and then write directly into that bucket:

results_mt.write('s3a://my-s3-bucket/results.mt')