IOException: No FileSystem for scheme: gs

rahulch · November 1, 2021, 2:57pm

while trying to run load_dataset on AWS EMR I see below error. I am using pyspark and initiating hail.

[hadoop@ip-172-31-101-148 ~]$ sudo pyspark Python 3.6.12 (default, May 18 2021, 22:47:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/01 02:38:39 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 21/11/01 02:38:41 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ / / '_/
/ / .__/_,// //_\ version 2.4.4
//

Using Python version 3.6.12 (default, May 18 2021 22:47:55)
SparkSession available as ‘spark’.

import hail as hl
hl.init(sc)
/usr/local/lib/python3.6/site-packages/hail/backend/backend.py:130: UserWarning: pip-installed Hail requires additional configuration options in Spark referring
to the path to the Hail Python module directory HAIL_DIR,
e.g. /path/to/python/site-packages/hail:
spark.jars=HAIL_DIR/hail-all-spark.jar
spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
‘pip-installed Hail requires additional configuration options in Spark referring\n’
Running on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-101-148.ec2.internal:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ / / / /_/ /_/\_,_/_/_/ version 0.2.37-7952b436bd70 LOGGING: writing to /home/hadoop/hail-20211101-0240-0.2.37-7952b436bd70.log mt = hl.experimental.load_dataset(name='dbSNP') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: load_dataset() missing 2 required positional arguments: 'version' and 'reference_genome' mt = hl.experimental.load_dataset(name='dbSNP',version='154',reference_genome='GRCh38') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.6/site-packages/hail/experimental/datasets.py", line 33, in load_dataset with hl.hadoop_open(config_file, 'r') as f: File "<decorator-gen-6>", line 2, in hadoop_open File "/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py", line 585, in wrapper return __original_func(*args_, **kwargs_) File "/usr/local/lib/python3.6/site-packages/hail/utils/hadoop_utils.py", line 79, in hadoop_open return Env.fs().open(path, mode, buffer_size) File "/usr/local/lib/python3.6/site-packages/hail/fs/hadoop_fs.py", line 12, in open handle = io.BufferedReader(HadoopReader(path, buffer_size), buffer_size=buffer_size) File "/usr/local/lib/python3.6/site-packages/hail/fs/hadoop_fs.py", line 45, in __init__ self._jfile = Env.jutils().readFile(path, Env.backend()._jhc, buffer_size) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/local/lib/python3.6/site-packages/hail/utils/java.py", line 211, in deco 'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None hail.utils.java.FatalError: IOException: No FileSystem for scheme: gs

Any help is much appreciated

rahulch · November 1, 2021, 5:23pm

wanted to add that I also tried below command since the default is gcp:

mt = hl.experimental.load_dataset(name=‘dbSNP’,version=‘154’,reference_genome=‘GRCh38’,region=‘us’,cloud=‘aws’)

Traceback (most recent call last):

File “”, line 1, in

TypeError: load_dataset() got an unexpected keyword argument ‘region’

tpoterba · November 1, 2021, 5:25pm

You’re using a super old version of Hail. if you update to latest, I think this should work.

rahulch · November 1, 2021, 6:16pm

tried that but still the same. I’m thinking of doing a fresh installation unless you can catch my mistake

python3.6 -m pip install hail --upgrade

hl.init(sc)
/usr/local/lib/python3.6/site-packages/hail/backend/backend.py:130: UserWarning: pip-installed Hail requires additional configuration options in Spark referring
to the path to the Hail Python module directory HAIL_DIR,
e.g. /path/to/python/site-packages/hail:
spark.jars=HAIL_DIR/hail-all-spark.jar
spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
‘pip-installed Hail requires additional configuration options in Spark referring\n’
Running on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-101-148.ec2.internal:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
// //_,/// version 0.2.37-7952b436bd70
LOGGING: writing to /home/hadoop/hail-20211101-1814-0.2.37-7952b436bd70.log

tpoterba · November 1, 2021, 6:46pm

Can you do:

pip show hail

and

python3.6 -c "import hail as hl; print(hl.__file__)"

rahulch · November 1, 2021, 7:03pm

Thanks @tpoterba

[hadoop@ip-172-31-101-148 ~]$ pip show hail

Name: hail
Version: 0.2.74
Summary: Scalable library for exploring and analyzing genomic data.
Home-page: https://hail.is
Author: Hail Team
Author-email: hail@broadinstitute.org
License: UNKNOWN
Location: /home/hadoop/.local/lib/python3.6/site-packages
Requires: aiohttp, aiohttp-session, asyncinit, bokeh, boto3, botocore, decorator, Deprecated, dill, fsspec, gcsfs, google-cloud-storage, humanize, hurry.filesize, janus, nest-asyncio, numpy, pandas, parsimonious, PyJWT, pyspark, python-json-logger, requests, scipy, tabulate, tqdm
Required-by:
[hadoop@ip-172-31-101-148 ~] [hadoop@ip-172-31-101-148 ~] python3.6 -c “import hail as hl; print(hl.file)”
Traceback (most recent call last):
File “”, line 1, in
File “/home/hadoop/.local/lib/python3.6/site-packages/hail/init.py”, line 44, in
from .table import Table, GroupedTable, asc, desc # noqa: E402
File “/home/hadoop/.local/lib/python3.6/site-packages/hail/table.py”, line 3, in
import pandas
File “/home/hadoop/.local/lib/python3.6/site-packages/pandas/init.py”, line 22, in
from pandas.compat.numpy import (
File “/home/hadoop/.local/lib/python3.6/site-packages/pandas/compat/numpy/init.py”, line 21, in
“this version of pandas is incompatible with numpy < 1.15.4\n”
ImportError: this version of pandas is incompatible with numpy < 1.15.4
your numpy version is 1.14.5.
Please upgrade numpy to >= 1.15.4 to use this pandas version
[hadoop@ip-172-31-101-148 ~]$

danking · November 2, 2021, 2:39pm

Hey @rahulch ,

How are you installing Hail? I don’t expect Hail-from-pip to work properly on EMR. I believe you have. three options:

Some folks at harvard med have some scripts for running Hail on Amazon.
You could also install Hail from source on the master node of the spark cluster.
We maintain a tool, hailctl (which is included in the Hail python package) for running Hail on Google Dataproc, if you’re able to use Google Cloud instead.

Topic		Replies	Views
Running Hail on AWS Help [0.1]	29	3741	January 9, 2019
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3" Hail Query & hailctl	11	15285	September 28, 2022
TypeError: 'JavaPackage' object is not callable on AWS EMR when adding jars Hail Query & hailctl	1	577	March 30, 2021
Running hail on AWS EMR Help [0.1]	6	1701	April 27, 2018
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	750	February 22, 2022

IOException: No FileSystem for scheme: gs

Related topics