MatrixTable file not being written in working directory

hpatel96 · December 21, 2022, 10:24pm

I am using AnVIL’s Terra platform to run a JupyterNotebook and utilize Hail in that notebook to conduct some genetic analyses. The problem I am running into is that I cannot find the MatrixTable file directory in the notebook’s persistent disk when I run a “!ls -a” command when inside the notebook. The code I am using is below:

Import and initialize Hail

import hail as hl
hl.init()
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

Print contents of current directory

!ls -a

Add path to a .gz zipped VCF file

direct_path_to_gz = “” #left out path for privacy reasons
hl.import_vcf(direct_path_to_gz, reference_genome=‘GRCh37’, n_partitions = 512, force_bgz = True).write(“name_of_matrix_table”, overwrite=True) #left out name for privacy reasons

#Re-print contents of current directory

!ls -a

When I run the second “!ls -a” command, I see the Hail log file but no MatrixTable directory. The weird thing is that I am able to run the read matrix table function, providing the same name that was given to the write function, and that works fine. How is Hail finding a directory that does not exist? I need the directory so I can “scp” that directory from the notebook’s persistent disk storage into the workspace bucket storage.

Thanks for your help!

danking · December 21, 2022, 10:49pm

Hi @hpatel96 ! Sorry to hear you’re having trouble. I’m sure we can fix this for you.

Hail in Terra uses an Apache Spark cluster. Apache Spark clusters use Hadoop file systems. When you write to, say, /foo/bar/baz you’re writing into Hadoop, not the local file system on the driver / notebook node. In general, your notebook isn’t necessarily on the same node as the driver anyway. Moreover, all the worker nodes don’t share the driver’s filesystem, so they couldn’t write to it anyway.

I strongly recommend against using Hadoop / HDFS for anything. Use Google Cloud Storage pervasively for all your storage needs. When you init Hail, set hl.init(tmp='gs://your_bucket/tmp/'). When you write, write to "gs://your_bucket/project/data.mt".

hpatel96 · December 29, 2022, 5:01pm

Thanks so much! This worked and I was able to get it up and running!

Topic		Replies	Views
Cannot load a Hail Table in Terra notebook Hail Query & hailctl	2	528	August 3, 2021
UKBiobank Research Analysis Platform (RAP) MatrixTable Write Issues Hail Query & hailctl	21	2553	February 22, 2023
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	751	February 22, 2022
Can't find Hail export file location Hail Query & hailctl	3	441	November 15, 2022
Small MatrixTable hangs on write into Google bucket Hail Query & hailctl	13	911	September 5, 2019

MatrixTable file not being written in working directory

Import and initialize Hail

Print contents of current directory

Add path to a .gz zipped VCF file

Related topics