Reading VDS from google bucket fires error


#1

Hello,

I am running the following commend:
In [4]: hc.read(‘gs://data/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
and getting FatalError: IOException: No FileSystem for scheme: gs
any idea why there is an issue? the error details are below.
thanks,eilalan


In [5]: hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()

FatalError Traceback (most recent call last)
in ()
----> 1 hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()

in read(self, path, sites_only, samples_only)

/home/eila/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
111 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
112 ‘Hail version: %s\n’
–> 113 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
114 except py4j.protocol.Py4JError as e:
115 if e.args[0].startswith(‘An error occurred while calling’):

FatalError: IOException: No FileSystem for scheme: gs

Java stack trace:
java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.utils.richUtils.RichHadoopConfiguration$.fileSystem$extension(RichHadoopConfiguration.scala:17)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
at is.hail.utils.richUtils.RichHadoopConfiguration$.exists$extension(RichHadoopConfiguration.scala:51)
at is.hail.variant.VariantDataset$.readMetadata(VariantDataset.scala:103)
at is.hail.HailContext.readMetadata(HailContext.scala:394)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at is.hail.HailContext.readAllMetadata(HailContext.scala:396)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Hail version: devel-a0e653f
Error summary: IOException: No FileSystem for scheme: gs
In [6]:


#2

It looks like you’re not running on Google Dataproc. You need to be using dataproc to have access to files in Google buckets. Here’s a helpful guide Laurent put together for this:


#3

My guess is either you’re running locally (a laptop, for example) or running directly on a cloud VM without using the dataproc cluster.


#4

I am running on the master machine (ssh): eila@cluster-2-m:~/hail$
the data is copy on gnomAD
what am i missing? …


#5

Running on the master machine through ssh is just like running on your laptop – it’s using the Spark local mode, which means you’re not only unable to see google buckets, but you’re also not using any of the other machines in the cluster.

Instead, use:

gcloud dataproc jobs submit pyspark

There is an example of this in the post I linked above.


#6

To clarify, you’ll need to download the google cloud components to your computer to make this easier (if you haven’t done that and were starting the cluster from the UI). Then you can make a python script that uses Hail, and submit that to the cluster with gcloud dataproc jobs submit pyspark using the format in the post above.


#7

thank you for the clarification. I hope that i am getting closer to have it running…
Looking for the moment that i can make it work with all the amazing gnomAD data.
should I copy the common-hail fisle to my bucket and point to them? see below the error message.

My step were the followings:

  1. installed gcloud on my mac + connected to the project

  1. created the following script, hail_py.py
    from hail import *
    print(hc)
    hc = HailContext()
    print(vds)
    vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
    print(end)

  1. called gcloud dataproc jobs submit pyspark:
    wm8af-056:scripts landkof$ gcloud dataproc jobs submit pyspark --cluster=cluster-2 --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-E4880e9.jar --py-files=gs://hail-common/pyhail-hail-is-master-E4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar” hail_py.py
    Copying file://hail_py.py [Content-Type=text/x-python]…
    \ [1 files][ 166.0 B/ 166.0 B]
    Operation completed over 1 objects/166.0 B.
    Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] submitted.

  1. returned value.
    the error message is related to the gs://hail-common files:

Waiting for job output…
=========== Cloud Dataproc Agent Error ===========
java.io.FileNotFoundException: File not found : gs://hail-common/pyhail-hail-is-master-E4880e9.zip
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1427)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2034)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.copyToLocalFile(GoogleHadoopFileSystemBase.java:2006)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1979)
at com.google.cloud.hadoop.services.agent.util.HadoopUtil.download(HadoopUtil.java:71)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.downloadResources(AbstractJobHandler.java:424)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:543)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
======== End of Cloud Dataproc Agent Error ========
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] entered state [ERROR] while waiting for [DONE].



#8

solved this with and using the latest id that was published
gsutil ls -l gs://hail-common/pyhail-hail-is-master-*.zip


#9

Awesome! everything is working?


#10

It is for sure running. The trial version is very limited with CPU - so the execution time is slow. will keep you updated.
Thank you for your help.
I hope to be able to contribute to the project soon.
Thanks again!
eilalan