IOException: No FileSystem for scheme: gs

while trying to run load_dataset on AWS EMR I see below error. I am using pyspark and initiating hail.

[hadoop@ip-172-31-101-148 ~]$ sudo pyspark Python 3.6.12 (default, May 18 2021, 22:47:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/01 02:38:39 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 21/11/01 02:38:41 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ / / '_/
/
/ .__/_,// //_\ version 2.4.4
/
/

Using Python version 3.6.12 (default, May 18 2021 22:47:55)
SparkSession available as ‘spark’.

import hail as hl
hl.init(sc)
/usr/local/lib/python3.6/site-packages/hail/backend/backend.py:130: UserWarning: pip-installed Hail requires additional configuration options in Spark referring
to the path to the Hail Python module directory HAIL_DIR,
e.g. /path/to/python/site-packages/hail:
spark.jars=HAIL_DIR/hail-all-spark.jar
spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
‘pip-installed Hail requires additional configuration options in Spark referring\n’
Running on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-101-148.ec2.internal:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.37-7952b436bd70
LOGGING: writing to /home/hadoop/hail-20211101-0240-0.2.37-7952b436bd70.log

mt = hl.experimental.load_dataset(name=‘dbSNP’)
Traceback (most recent call last):
File “”, line 1, in
TypeError: load_dataset() missing 2 required positional arguments: ‘version’ and ‘reference_genome’

mt = hl.experimental.load_dataset(name=‘dbSNP’,version=‘154’,reference_genome=‘GRCh38’)
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python3.6/site-packages/hail/experimental/datasets.py”, line 33, in load_dataset
with hl.hadoop_open(config_file, ‘r’) as f:
File “”, line 2, in hadoop_open
File “/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py”, line 585, in wrapper
return original_func(*args, **kwargs)
File “/usr/local/lib/python3.6/site-packages/hail/utils/hadoop_utils.py”, line 79, in hadoop_open
return Env.fs().open(path, mode, buffer_size)
File “/usr/local/lib/python3.6/site-packages/hail/fs/hadoop_fs.py”, line 12, in open
handle = io.BufferedReader(HadoopReader(path, buffer_size), buffer_size=buffer_size)
File “/usr/local/lib/python3.6/site-packages/hail/fs/hadoop_fs.py”, line 45, in init
self._jfile = Env.jutils().readFile(path, Env.backend()._jhc, buffer_size)
File “/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in call
File “/usr/local/lib/python3.6/site-packages/hail/utils/java.py”, line 211, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: IOException: No FileSystem for scheme: gs`

Any help is much appreciated

wanted to add that I also tried below command since the default is gcp:

mt = hl.experimental.load_dataset(name=‘dbSNP’,version=‘154’,reference_genome=‘GRCh38’,region=‘us’,cloud=‘aws’)

Traceback (most recent call last):

File “”, line 1, in

TypeError: load_dataset() got an unexpected keyword argument ‘region’

You’re using a super old version of Hail. if you update to latest, I think this should work.

tried that but still the same. I’m thinking of doing a fresh installation unless you can catch my mistake

python3.6 -m pip install hail --upgrade

hl.init(sc)
/usr/local/lib/python3.6/site-packages/hail/backend/backend.py:130: UserWarning: pip-installed Hail requires additional configuration options in Spark referring
to the path to the Hail Python module directory HAIL_DIR,
e.g. /path/to/python/site-packages/hail:
spark.jars=HAIL_DIR/hail-all-spark.jar
spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
‘pip-installed Hail requires additional configuration options in Spark referring\n’
Running on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-101-148.ec2.internal:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.2.37-7952b436bd70
LOGGING: writing to /home/hadoop/hail-20211101-1814-0.2.37-7952b436bd70.log

Can you do:

pip show hail

and

python3.6 -c "import hail as hl; print(hl.__file__)"

Thanks @tpoterba

[hadoop@ip-172-31-101-148 ~]$ pip show hail

Name: hail
Version: 0.2.74
Summary: Scalable library for exploring and analyzing genomic data.
Home-page: https://hail.is
Author: Hail Team
Author-email: hail@broadinstitute.org
License: UNKNOWN
Location: /home/hadoop/.local/lib/python3.6/site-packages
Requires: aiohttp, aiohttp-session, asyncinit, bokeh, boto3, botocore, decorator, Deprecated, dill, fsspec, gcsfs, google-cloud-storage, humanize, hurry.filesize, janus, nest-asyncio, numpy, pandas, parsimonious, PyJWT, pyspark, python-json-logger, requests, scipy, tabulate, tqdm
Required-by:
[hadoop@ip-172-31-101-148 ~] [hadoop@ip-172-31-101-148 ~] python3.6 -c “import hail as hl; print(hl.file)”
Traceback (most recent call last):
File “”, line 1, in
File “/home/hadoop/.local/lib/python3.6/site-packages/hail/init.py”, line 44, in
from .table import Table, GroupedTable, asc, desc # noqa: E402
File “/home/hadoop/.local/lib/python3.6/site-packages/hail/table.py”, line 3, in
import pandas
File “/home/hadoop/.local/lib/python3.6/site-packages/pandas/init.py”, line 22, in
from pandas.compat.numpy import (
File “/home/hadoop/.local/lib/python3.6/site-packages/pandas/compat/numpy/init.py”, line 21, in
“this version of pandas is incompatible with numpy < 1.15.4\n”
ImportError: this version of pandas is incompatible with numpy < 1.15.4
your numpy version is 1.14.5.
Please upgrade numpy to >= 1.15.4 to use this pandas version
[hadoop@ip-172-31-101-148 ~]$

Hey @rahulch ,

How are you installing Hail? I don’t expect Hail-from-pip to work properly on EMR. I believe you have. three options:

  • Some folks at harvard med have some scripts for running Hail on Amazon.
  • You could also install Hail from source on the master node of the spark cluster.
  • We maintain a tool, hailctl (which is included in the Hail python package) for running Hail on Google Dataproc, if you’re able to use Google Cloud instead.