MatrixTable file not being written in working directory

I am using AnVIL’s Terra platform to run a JupyterNotebook and utilize Hail in that notebook to conduct some genetic analyses. The problem I am running into is that I cannot find the MatrixTable file directory in the notebook’s persistent disk when I run a “!ls -a” command when inside the notebook. The code I am using is below:

Import and initialize Hail

import hail as hl
hl.init()
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

Print contents of current directory

!ls -a

Add path to a .gz zipped VCF file

direct_path_to_gz = “” #left out path for privacy reasons
hl.import_vcf(direct_path_to_gz, reference_genome=‘GRCh37’, n_partitions = 512, force_bgz = True).write(“name_of_matrix_table”, overwrite=True) #left out name for privacy reasons

#Re-print contents of current directory

!ls -a

When I run the second “!ls -a” command, I see the Hail log file but no MatrixTable directory. The weird thing is that I am able to run the read matrix table function, providing the same name that was given to the write function, and that works fine. How is Hail finding a directory that does not exist? I need the directory so I can “scp” that directory from the notebook’s persistent disk storage into the workspace bucket storage.

Thanks for your help!

Hi @hpatel96 ! Sorry to hear you’re having trouble. I’m sure we can fix this for you.

Hail in Terra uses an Apache Spark cluster. Apache Spark clusters use Hadoop file systems. When you write to, say, /foo/bar/baz you’re writing into Hadoop, not the local file system on the driver / notebook node. In general, your notebook isn’t necessarily on the same node as the driver anyway. Moreover, all the worker nodes don’t share the driver’s filesystem, so they couldn’t write to it anyway.

I strongly recommend against using Hadoop / HDFS for anything. Use Google Cloud Storage pervasively for all your storage needs. When you init Hail, set hl.init(tmp='gs://your_bucket/tmp/'). When you write, write to "gs://your_bucket/project/data.mt".

Thanks so much! This worked and I was able to get it up and running!

1 Like