Cannot load a Hail Table in Terra notebook

I am using a jupyter notebook in Terra.

version 0.2.62

I am in a directory local to the Hail Table directory.

trainings = hl.read_table("") gives an error HailException: MatrixTable and Table files are directories; path '' is not a directory

If I use the full path, then trainings = hl.read_table("file:///home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/") appears to work:


Global fields:
Row fields:
    's': str 
    'scores': array<float64> 
    'population_inference.pop': str 
    'pop_label': str 
Key: ['s']

BUT if I actually try to use it in a calculation:

tmp = trainings.s.collect()

FileNotFoundException: File file:/home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/ does not exist

But this is clearly erroneous, since:
! ls -al /home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/

-rw-r--r-- 1 jupyter users 26311 Jul 13 21:07 /home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/

How should I be specifying local file paths? I believe I am doing the same thing as the reference material. Is the jupyter notebook throwing in an unforeseen complication?

The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.

This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://) then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name

You can read this file because the read_table operation only triggers a metadata read on the driver node to read types and such, it doesn’t load any of the data. When you try to compute on the data like to collect it, a worker node is enlisted to read and compute on the data, and there’s no file:///home/jupyter/... path on the worker nodes’ local file systems.

The answer is that you can’t use a cluster to compute on a Hail table that’s stored on the local file system on the driver node. Instead, copy to HDFS or Google Storage first.

Okay, that makes sense. It also explains why “sometimes it seems to work”. I was probably using a single machine in those runs.