Cannot load a Hail Table in Terra notebook

The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.

This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://) then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name training_pca.ht.

You can read this file because the read_table operation only triggers a metadata read on the driver node to read types and such, it doesn’t load any of the data. When you try to compute on the data like to collect it, a worker node is enlisted to read and compute on the data, and there’s no file:///home/jupyter/... path on the worker nodes’ local file systems.

The answer is that you can’t use a cluster to compute on a Hail table that’s stored on the local file system on the driver node. Instead, copy to HDFS or Google Storage first.