No file or directory found at gs:

Hi, I’m trying to query the data at Genebass,

hl.init(local='local[2]', log=logfile, tmp_dir=tmpdir)
genebass = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/results.mt')

and getting the following error:

(hail) [basic-dy-t3axlarge-1 hail]$ python3 spark-query-genes.py
2023-01-23 11:08:55 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-01-23 11:08:56 WARN  Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
  Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://basic-dy-t3axlarge-1.bioinformatics-cro-hpc-slurm.pcluster.:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.77-684f32d73643
LOGGING: writing to ~/tmp/hail//hail-filter.log
Traceback (most recent call last):
  File "spark-query-genes.py", line 12, in <module>
    genebass = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/results.mt')
  File "<decorator-gen-1344>", line 2, in read_matrix_table
  File "/apps/users/user2031/mambaforge/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/apps/users/user2031/mambaforge/envs/hail/lib/python3.7/site-packages/hail/methods/impex.py", line 2115, in read_matrix_table
    for rg_config in Env.backend().load_references_from_dataset(path):
  File "/apps/users/user2031/mambaforge/envs/hail/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
    return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
  File "/apps/users/user2031/mambaforge/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/apps/users/user2031/mambaforge/envs/hail/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: HailException: No file or directory found at gs://ukbb-exome-public/500k/results/results.mt

Java stack trace:
is.hail.utils.HailException: No file or directory found at gs://ukbb-exome-public/500k/results/results.mt
        at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
        at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
        at is.hail.utils.package$.fatal(package.scala:78)
        at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:32)
        at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:73)
        at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)
        at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.77-684f32d73643
Error summary: HailException: No file or directory found at gs://ukbb-exome-public/500k/results/results.mt

However, the file is definitely there:

~ » gcloud storage ls gs://ukbb-exome-public/500k/results/results.mt
gs://ukbb-exome-public/500k/results/results.mt/
gs://ukbb-exome-public/500k/results/results.mt/README.txt
gs://ukbb-exome-public/500k/results/results.mt/_SUCCESS
gs://ukbb-exome-public/500k/results/results.mt/metadata.json.gz
gs://ukbb-exome-public/500k/results/results.mt/cols/
gs://ukbb-exome-public/500k/results/results.mt/entries/
gs://ukbb-exome-public/500k/results/results.mt/globals/
gs://ukbb-exome-public/500k/results/results.mt/index/
gs://ukbb-exome-public/500k/results/results.mt/references/
gs://ukbb-exome-public/500k/results/results.mt/rows/

This used to work and can’t see the issue now. Thanks for the help!

Huh. This suggests that whatever file system implementation you’re using doesn’t recognize gs://ukbb-exome-public/500k/results/results.mt/ as existing. The Hadoop file system implementations provided by Google in Dataproc should work properly (we have many tests that use them). You might also try using GitHub - broadinstitute/install-gcs-connector: Downloads and installs the google storage connector which lets hadoop directly read/write files in google storage. to install a GCS Hadoop connector that works as expected.