The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.
This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://)
then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name training_pca.ht.
You can read this file because the read_table
operation only triggers a metadata read on the driver node to read types and such, it doesn’t load any of the data. When you try to compute on the data like to collect it, a worker node is enlisted to read and compute on the data, and there’s no file:///home/jupyter/...
path on the worker nodes’ local file systems.
The answer is that you can’t use a cluster to compute on a Hail table that’s stored on the local file system on the driver node. Instead, copy to HDFS or Google Storage first.