Cannot load a Hail Table in Terra notebook

LeeTL1220 · August 2, 2021, 2:01pm

I am using a jupyter notebook in Terra.

version 0.2.62

I am in a directory local to the training_pca.ht Hail Table directory.

trainings = hl.read_table("training_pca.ht") gives an error HailException: MatrixTable and Table files are directories; path 'training_pca.ht' is not a directory

If I use the full path, then trainings = hl.read_table("file:///home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/training_pca.ht") appears to work:

trainings.describe()

Global fields:
    None
----------------------------------------
Row fields:
    's': str 
    'scores': array<float64> 
    'population_inference.pop': str 
    'pop_label': str 
----------------------------------------
Key: ['s']
----------------------------------------

BUT if I actually try to use it in a calculation:

tmp = trainings.s.collect()

FileNotFoundException: File file:/home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/training_pca.ht/rows/parts/part-00-318-0-0-20b62ca5-63fa-b732-6305-2facd6b862b6 does not exist

But this is clearly erroneous, since:
! ls -al /home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/training_pca.ht/rows/parts/part-00-318-0-0-20b62ca5-63fa-b732-6305-2facd6b862b6

-rw-r--r-- 1 jupyter users 26311 Jul 13 21:07 /home/jupyter/notebooks/AoU_DRC_WGS_LTL_analyses/edit/training_pca.ht/rows/parts/part-00-318-0-0-20b62ca5-63fa-b732-6305-2facd6b862b6

How should I be specifying local file paths? I believe I am doing the same thing as the reference material. Is the jupyter notebook throwing in an unforeseen complication?

tpoterba · August 3, 2021, 1:28pm

The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.

This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://) then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name training_pca.ht.

You can read this file because the read_table operation only triggers a metadata read on the driver node to read types and such, it doesn’t load any of the data. When you try to compute on the data like to collect it, a worker node is enlisted to read and compute on the data, and there’s no file:///home/jupyter/... path on the worker nodes’ local file systems.

The answer is that you can’t use a cluster to compute on a Hail table that’s stored on the local file system on the driver node. Instead, copy to HDFS or Google Storage first.

LeeTL1220 · August 3, 2021, 2:08pm

Okay, that makes sense. It also explains why “sometimes it seems to work”. I was probably using a single machine in those runs.

Thanks!

Topic		Replies	Views
MatrixTable file not being written in working directory Hail Query & hailctl	2	429	December 29, 2022
Import from local mt Hail Query & hailctl	7	550	March 10, 2021
Importing hail table from remote location Hail Query & hailctl	4	761	April 6, 2020
Hail with local spark mode unable to read matrix table from AWS S3 bucket Hail Query & hailctl	1	727	May 7, 2021
Hail on Goocle Cloud with Windows OS Help [0.1]	23	2953	May 25, 2018

Cannot load a Hail Table in Terra notebook

Related topics