Cannot load a Hail Table in Terra notebook

tpoterba · August 3, 2021, 1:28pm

The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.

This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://) then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name training_pca.ht.

You can read this file because the read_table operation only triggers a metadata read on the driver node to read types and such, it doesn’t load any of the data. When you try to compute on the data like to collect it, a worker node is enlisted to read and compute on the data, and there’s no file:///home/jupyter/... path on the worker nodes’ local file systems.

The answer is that you can’t use a cluster to compute on a Hail table that’s stored on the local file system on the driver node. Instead, copy to HDFS or Google Storage first.

Topic		Replies	Views
MatrixTable file not being written in working directory Hail Query & hailctl	2	429	December 29, 2022
Import from local mt Hail Query & hailctl	7	550	March 10, 2021
Importing hail table from remote location Hail Query & hailctl	4	761	April 6, 2020
Hail with local spark mode unable to read matrix table from AWS S3 bucket Hail Query & hailctl	1	727	May 7, 2021
Hail on Goocle Cloud with Windows OS Help [0.1]	23	2952	May 25, 2018

Cannot load a Hail Table in Terra notebook

Related topics