FileNotFoundException when reading VDS

Joel_Thibault · June 5, 2017, 8:23pm

Continuing my adventures from earlier I am now successfully submitting Spark jobs from my Docker container to DataProc. Now I’m attempting to run Hail commands in Pyspark. Not sure if the problem I seeing is a result of my unusual setup.

I’m trying to load a local (ok, Docker-mounted) VDS file that I have copied from a previous successful Hail run, but I’m getting a File Not Found error.

>>> from hail import *
>>> hc = HailContext(sc)
hail: info: SparkUI: http://10.128.0.6:4040
>>> count = hc.read('file:/home/joel/work/1kg.vds').count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-153>", line 2, in read
  File "/hail/python/hail/java.py", line 104, in handle_py4j
    raise FatalError(msg)
hail.java.FatalError: FileNotFoundException: File file:/home/joel/work/1kg.vds/rdd.parquet/part-r-00000-4b3d3bac-2697-423b-8283-07af77b46a72.snappy.parquet does not exist

However I do see the file /home/joel/work/1kg.vds/rdd.parquet/part-r-00000-4b3d3bac-2697-423b-8283-07af77b46a72.snappy.parquet

Do you know why it might not see this file? Thanks.

Also, do I have the syntax right for local files? I also tried hc.read('/home/joel/work/1kg.vds').count() but I got a different error hail.java.FatalError: arguments refer to no files instead

danking · June 5, 2017, 9:22pm

It looks like you may have an old version of Hail. Newer versions include a full Java stack trace. If you can use a newer version, that’d help debugging in the future. In general, the latest version of the Hail jar and the python files are available at:

dking@wmb16-359 # gsutil ls "gs://hail-common/hail-hail-is-master-all-spark2.0.2-$(gsutil cat gs://hail-common/latest-hash.txt).jar"
gs://hail-common/hail-hail-is-master-all-spark2.0.2-53e9d33.jar
dking@wmb16-359 # gsutil ls "gs://hail-common/pyhail-hail-is-master-$(gsutil cat gs://hail-common/latest-hash.txt).zip" 
gs://hail-common/pyhail-hail-is-master-53e9d33.zip

You correctly deduced that the file: prefix is required, the default file system is whatever Hadoop file system is connected to the cluster.

My best guess is that the user executing the Spark job does not have the same permissions that your shell session has. @tpoterba is out of town this week, but might have a better idea of what is wrong when he returns. It’s also possible this is a now fixed bug in the version of Hail you’re using.

danking · June 6, 2017, 12:39pm

@Joel_Thibault I’ve given some more thought to this issue and I’m somewhat surprised that you’re reading from the local file system. I guess this error may come from an executor on a worker node which doesn’t share a local file system with the master node where, I assume, you’ve placed 1kg.vds. Usually, when storing a parquet file like a VDS one stores it on a network mounted file system like HDFS so that the cluster’s executors can ingest the file in parallel. I’m not sure if Dataproc has a traditional HDFS. Google pushes Google Storage pretty aggressively. Is it possible for you to store 1kg.vds in Google Storage instead?

Joel_Thibault · June 6, 2017, 3:28pm

OK - our workflow will be using GCS anyway, so I’ll stop trying to make local files work where it’s not appropriate. I’ll upgrade Hail too so I won’t waste your time making bug reports for old versions!

Thanks.

Joel_Thibault · June 7, 2017, 4:23pm

Working much better now with gs:// files. Thanks.

Topic		Replies	Views
Doing a linear_mixed_regression_rows, why this error Hail Query & hailctl	5	582	October 12, 2019
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	750	February 22, 2022
File does not exist error message when import_plink command is used in a hail environment spark cluster within terra workspace Hail Query & hailctl	0	15	January 21, 2025
Hail 0.2 class not found exception on EMR Hail Query & hailctl	29	2773	August 20, 2018
Missing class when reading .vds from S3 Help [0.1]	1	654	May 22, 2018

FileNotFoundException when reading VDS

Related topics