How should I use Hail on the DNANexus RAP?

danking · September 30, 2021, 3:41pm

importing data

The project file system is used for files not produced by Hail. Examples include VCF or BGEN files. You should always access this data through the DNANexus Project File System FUSE mount: file:///mnt/project/.... Do not download this data onto HDFS.

The project file system can handle neither Hail Tables nor Hail Matrix Tables; however, DNANexus provides the database file system for storing Table and Matrix Table files. A “database” in the database file system is roughly analogous to a “bucket” in Amazon S3 or Google Cloud Storage. You can create a “database” by executing this Python code:

import pyspark
import dxpy

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql("CREATE DATABASE my_database LOCATION  'dnax://'")

You only need to do this once. Every time you initialize Hail, you must use this database as the temporary directory and you must start a Spark SQL Session before you initialize Hail.

import pyspark
import dxpy
import hail as hl

my_database = dxpy.find_one_data_object(
    name="my_database", 
    project=dxpy.find_one_project()["id"]
)["id"]
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
hl.init(sc=sc, tmp_dir=f'dnax://{my_database}/tmp/')

Moreover, any time you want to write a Table or Matrix Table you need to use a path that starts with f'dnax://{my_database}'.

NB: You cannot inspect the contents of a DNANexus database from the web UI, so be certain to remember all the file paths you’ve used!

exporting data

If you export a merged TSV or VCF and want to move it from the database file system to the project file system, execute this shell command:

hadoop fs -cat dnax://my_database/path | dx upload -o /path/in/project

If you need to export a results Table or Matrix Table (which are comprised of many individual files), configure your DNANexus sparks cluster to directly access an S3 bucket you own and then write directly into that bucket:

results_mt.write('s3a://my-s3-bucket/results.mt')

Topic		Replies	Views
Can't export to plink/bgen/vcf on DNAnexus Hail Query & hailctl	5	640	September 28, 2022
Issues writing matrix table from filtered pVCF (UK Biobank data) Hail Query & hailctl	1	458	September 22, 2022
UKBiobank Research Analysis Platform (RAP) MatrixTable Write Issues Hail Query & hailctl	21	2538	February 22, 2023
VDS combiner unsuccesful on large cohort Hail Query & hailctl	2	186	May 22, 2024
Error reading filename with brackets (DNAnexus) Hail Query & hailctl	0	190	January 18, 2024

How should I use Hail on the DNANexus RAP?

importing data

exporting data

Related topics