FileNotFoundException when reading VDS

Continuing my adventures from earlier I am now successfully submitting Spark jobs from my Docker container to DataProc. Now I’m attempting to run Hail commands in Pyspark. Not sure if the problem I seeing is a result of my unusual setup.

I’m trying to load a local (ok, Docker-mounted) VDS file that I have copied from a previous successful Hail run, but I’m getting a File Not Found error.

>>> from hail import *
>>> hc = HailContext(sc)
hail: info: SparkUI: http://10.128.0.6:4040
>>> count = hc.read('file:/home/joel/work/1kg.vds').count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-153>", line 2, in read
  File "/hail/python/hail/java.py", line 104, in handle_py4j
    raise FatalError(msg)
hail.java.FatalError: FileNotFoundException: File file:/home/joel/work/1kg.vds/rdd.parquet/part-r-00000-4b3d3bac-2697-423b-8283-07af77b46a72.snappy.parquet does not exist

However I do see the file /home/joel/work/1kg.vds/rdd.parquet/part-r-00000-4b3d3bac-2697-423b-8283-07af77b46a72.snappy.parquet

Do you know why it might not see this file? Thanks.

Also, do I have the syntax right for local files? I also tried hc.read('/home/joel/work/1kg.vds').count() but I got a different error hail.java.FatalError: arguments refer to no files instead

It looks like you may have an old version of Hail. Newer versions include a full Java stack trace. If you can use a newer version, that’d help debugging in the future. In general, the latest version of the Hail jar and the python files are available at:

dking@wmb16-359 # gsutil ls "gs://hail-common/hail-hail-is-master-all-spark2.0.2-$(gsutil cat gs://hail-common/latest-hash.txt).jar"
gs://hail-common/hail-hail-is-master-all-spark2.0.2-53e9d33.jar
dking@wmb16-359 # gsutil ls "gs://hail-common/pyhail-hail-is-master-$(gsutil cat gs://hail-common/latest-hash.txt).zip" 
gs://hail-common/pyhail-hail-is-master-53e9d33.zip

You correctly deduced that the file: prefix is required, the default file system is whatever Hadoop file system is connected to the cluster.

My best guess is that the user executing the Spark job does not have the same permissions that your shell session has. @tpoterba is out of town this week, but might have a better idea of what is wrong when he returns. It’s also possible this is a now fixed bug in the version of Hail you’re using.

@Joel_Thibault I’ve given some more thought to this issue and I’m somewhat surprised that you’re reading from the local file system. I guess this error may come from an executor on a worker node which doesn’t share a local file system with the master node where, I assume, you’ve placed 1kg.vds. Usually, when storing a parquet file like a VDS one stores it on a network mounted file system like HDFS so that the cluster’s executors can ingest the file in parallel. I’m not sure if Dataproc has a traditional HDFS. Google pushes Google Storage pretty aggressively. Is it possible for you to store 1kg.vds in Google Storage instead?

OK - our workflow will be using GCS anyway, so I’ll stop trying to make local files work where it’s not appropriate. I’ll upgrade Hail too so I won’t waste your time making bug reports for old versions!

Thanks.

Working much better now with gs:// files. Thanks.