Reading VDS from google bucket fires error

eilalan · May 12, 2017, 5:42pm

Hello,

I am running the following commend:
In [4]: hc.read(‘gs://data/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
and getting FatalError: IOException: No FileSystem for scheme: gs
any idea why there is an issue? the error details are below.
thanks,eilalan

In [5]: hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()

FatalError Traceback (most recent call last)
in ()
----> 1 hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()

in read(self, path, sites_only, samples_only)

/home/eila/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
111 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
112 ‘Hail version: %s\n’
–> 113 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
114 except py4j.protocol.Py4JError as e:
115 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: IOException: No FileSystem for scheme: gs

Java stack trace:
java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.utils.richUtils.RichHadoopConfiguration$.fileSystem$extension(RichHadoopConfiguration.scala:17)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
at is.hail.utils.richUtils.RichHadoopConfiguration$.exists$extension(RichHadoopConfiguration.scala:51)
at is.hail.variant.VariantDataset$.readMetadata(VariantDataset.scala:103)
at is.hail.HailContext.readMetadata(HailContext.scala:394)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at is.hail.HailContext.readAllMetadata(HailContext.scala:396)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Hail version: devel-a0e653f
Error summary: IOException: No FileSystem for scheme: gs
In [6]:

tpoterba · May 12, 2017, 5:45pm

It looks like you’re not running on Google Dataproc. You need to be using dataproc to have access to files in Google buckets. Here’s a helpful guide Laurent put together for this:

tpoterba · May 12, 2017, 5:45pm

My guess is either you’re running locally (a laptop, for example) or running directly on a cloud VM without using the dataproc cluster.

eilalan · May 12, 2017, 5:51pm

I am running on the master machine (ssh): eila@cluster-2-m:~/hail$
the data is copy on gnomAD
what am i missing? …

tpoterba · May 12, 2017, 5:53pm

Running on the master machine through ssh is just like running on your laptop – it’s using the Spark local mode, which means you’re not only unable to see google buckets, but you’re also not using any of the other machines in the cluster.

Instead, use:

gcloud dataproc jobs submit pyspark

There is an example of this in the post I linked above.

tpoterba · May 12, 2017, 6:00pm

To clarify, you’ll need to download the google cloud components to your computer to make this easier (if you haven’t done that and were starting the cluster from the UI). Then you can make a python script that uses Hail, and submit that to the cluster with gcloud dataproc jobs submit pyspark using the format in the post above.

eilalan · May 12, 2017, 10:11pm

thank you for the clarification. I hope that i am getting closer to have it running…
Looking for the moment that i can make it work with all the amazing gnomAD data.
should I copy the common-hail fisle to my bucket and point to them? see below the error message.

My step were the followings:

installed gcloud on my mac + connected to the project

created the following script, hail_py.py
from hail import *
print(hc)
hc = HailContext()
print(vds)
vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
print(end)

called gcloud dataproc jobs submit pyspark:
wm8af-056:scripts landkof$ gcloud dataproc jobs submit pyspark --cluster=cluster-2 --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-E4880e9.jar --py-files=gs://hail-common/pyhail-hail-is-master-E4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar” hail_py.py
Copying file://hail_py.py [Content-Type=text/x-python]…
\ [1 files][ 166.0 B/ 166.0 B]
Operation completed over 1 objects/166.0 B.
Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] submitted.

returned value.
the error message is related to the gs://hail-common files:

Waiting for job output…
=========== Cloud Dataproc Agent Error ===========
java.io.FileNotFoundException: File not found : gs://hail-common/pyhail-hail-is-master-E4880e9.zip
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1427)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2034)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.copyToLocalFile(GoogleHadoopFileSystemBase.java:2006)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1979)
at com.google.cloud.hadoop.services.agent.util.HadoopUtil.download(HadoopUtil.java:71)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.downloadResources(AbstractJobHandler.java:424)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:543)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
======== End of Cloud Dataproc Agent Error ========
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] entered state [ERROR] while waiting for [DONE].

eilalan · May 13, 2017, 12:22am

solved this with and using the latest id that was published
gsutil ls -l gs://hail-common/pyhail-hail-is-master-*.zip

tpoterba · May 13, 2017, 12:49am

Awesome! everything is working?

eilalan · May 13, 2017, 1:04am

It is for sure running. The trial version is very limited with CPU - so the execution time is slow. will keep you updated.
Thank you for your help.
I hope to be able to contribute to the project soon.
Thanks again!
eilalan

Topic		Replies	Views
IOException: No FileSystem for scheme: gs Hail Batch & General Cloud	6	1762	November 2, 2021
No FileSystem for scheme "gs" Hail Batch & General Cloud	2	982	June 21, 2024
Using Hail on the Google Cloud Platform Help [0.1]	18	14018	September 14, 2017
No file or directory found at gs: Hail Query & hailctl	1	578	January 31, 2023
Problem with hl.import_vcf from google bucket Hail Query & hailctl	5	1018	January 10, 2019

Reading VDS from google bucket fires error

In [5]: hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()

Related topics