I am running the following commend:
In [4]: hc.read(‘gs://data/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
and getting FatalError: IOException: No FileSystem for scheme: gs
any idea why there is an issue? the error details are below.
thanks,eilalan
In [5]: hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
/home/eila/hail/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
111 raise FatalError(’%s\n\nJava stack trace:\n%s\n’
112 ‘Hail version: %s\n’
–> 113 ‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
114 except py4j.protocol.Py4JError as e:
115 if e.args[0].startswith(‘An error occurred while calling’):
FatalError: IOException: No FileSystem for scheme: gs
Java stack trace:
java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.utils.richUtils.RichHadoopConfiguration$.fileSystem$extension(RichHadoopConfiguration.scala:17)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at is.hail.utils.richUtils.RichHadoopConfiguration$$anonfun$exists$extension$1.apply(RichHadoopConfiguratio
n.scala:51)
at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
at is.hail.utils.richUtils.RichHadoopConfiguration$.exists$extension(RichHadoopConfiguration.scala:51)
at is.hail.variant.VariantDataset$.readMetadata(VariantDataset.scala:103)
at is.hail.HailContext.readMetadata(HailContext.scala:394)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at is.hail.HailContext$$anonfun$readAllMetadata$1.apply(HailContext.scala:396)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at is.hail.HailContext.readAllMetadata(HailContext.scala:396)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Hail version: devel-a0e653f
Error summary: IOException: No FileSystem for scheme: gs
In [6]:
It looks like you’re not running on Google Dataproc. You need to be using dataproc to have access to files in Google buckets. Here’s a helpful guide Laurent put together for this:
Running on the master machine through ssh is just like running on your laptop – it’s using the Spark local mode, which means you’re not only unable to see google buckets, but you’re also not using any of the other machines in the cluster.
Instead, use:
gcloud dataproc jobs submit pyspark
There is an example of this in the post I linked above.
To clarify, you’ll need to download the google cloud components to your computer to make this easier (if you haven’t done that and were starting the cluster from the UI). Then you can make a python script that uses Hail, and submit that to the cluster with gcloud dataproc jobs submit pyspark using the format in the post above.
thank you for the clarification. I hope that i am getting closer to have it running…
Looking for the moment that i can make it work with all the amazing gnomAD data.
should I copy the common-hail fisle to my bucket and point to them? see below the error message.
My step were the followings:
installed gcloud on my mac + connected to the project
created the following script, hail_py.py
from hail import *
print(hc)
hc = HailContext()
print(vds)
vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.autosomes.vds’).count()
print(end)
called gcloud dataproc jobs submit pyspark:
wm8af-056:scripts landkof$ gcloud dataproc jobs submit pyspark --cluster=cluster-2 --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-E4880e9.jar --py-files=gs://hail-common/pyhail-hail-is-master-E4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar” hail_py.py
Copying file://hail_py.py [Content-Type=text/x-python]…
\ [1 files][ 166.0 B/ 166.0 B]
Operation completed over 1 objects/166.0 B.
Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] submitted.
returned value.
the error message is related to the gs://hail-common files:
Waiting for job output…
=========== Cloud Dataproc Agent Error ===========
java.io.FileNotFoundException: File not found : gs://hail-common/pyhail-hail-is-master-E4880e9.zip
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1427)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2034)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.copyToLocalFile(GoogleHadoopFileSystemBase.java:2006)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1979)
at com.google.cloud.hadoop.services.agent.util.HadoopUtil.download(HadoopUtil.java:71)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.downloadResources(AbstractJobHandler.java:424)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:543)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
======== End of Cloud Dataproc Agent Error ========
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36170c29-6b31-4ba7-a90b-c1322853c8d7] entered state [ERROR] while waiting for [DONE].
It is for sure running. The trial version is very limited with CPU - so the execution time is slow. will keep you updated.
Thank you for your help.
I hope to be able to contribute to the project soon.
Thanks again!
eilalan