No Filesystem for scheme "s3" with import_vcf

This question is a continuation of
https://discuss.hail.is/t/error-summary-unsupportedfilesystemexception-no-filesystem-for-scheme-s3/2094
I’m trying to read a VCF from s3 using hail, and keep getting a “no filesystem for scheme s3” error.

>> import hail as hl
>> hl.init(log='/home/hadoop/hailstuff.log')
Running on Apache Spark version 3.1.3
SparkUI available at http://ip-10-130-27-85.columbuschildrens.net:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.100-2ea2615a797a
LOGGING: writing to /home/hadoop/hailcat.log
>> mt=hl.import_vcf('s3://my-bucket/my.vcf',reference_genome='GRCh38')
..
Hail version: 0.2.100-2ea2615a797a
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3"

If I replace “s3” with “s3a,” it gives me a permissions error, regardless of whether the file I’m requesting actually exists.

I followed the directions in this post
https://discuss.hail.is/t/error-summary-unsupportedfilesystemexception-no-filesystem-for-scheme-s3/2094
but to no avail. Using the s3 connector from https://gist.github.com/danking/f8387f5681b03edc5babdf36e14140bc unfortunately didn’t change anything, not even if I opened up spark-defaults.conf and removed everything from spark.hadoop.fs.s3a.aws.credentials.provider except org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider.

Maybe of note: I install my python and hail from scratch from the EMR command line using the command sequence:

sudo rm -rf /usr/local/lib/python3.7/site-packages/numpy
sudo rm -rf /usr/local/lib64/python3.7/site-packages/numpy
sudo yum install python3-devel -y
python3 -m venv curenv
source ~/curenv/bin/activate
pip3 install pandas sklearn scipy smart-open[s3] matplotlib seaborn
pip3 install hail

So I run Hail within the conda environment curenv, and my SPARK_HOME variable is /home/hadoop/curenv/lib64/python3.7/site-packages/pyspark . I have to manually create an empty conf/spark-defaults.conf file in that directory before I can run s3 connector script.

I know this question has been addressed, but reading from s3 is a crucial Hail task and I’d really love to have this ability. Thanks for any help you can offer.

Hey @jbgaither !

Working with Hadoop filesystems is really a pain! I’m sorry you’re experiencing it firsthand.

First, I have a question: what environment are you in? It sounds like you might be using Amazon EMR. I would expect Amazon EMR to already be configured correctly for the s3 protocol (Amazon claims EMR supports s3://). If you just start a new cluster and don’t touch any configuration files or install anything, it should work. Does it not? What happens if you SSH to the leader node and run the following command?

hadoop fs -ls s3://YOUR_BUCKET_HERE/

Ah, perhaps the issue is the way you’re installing Hail. The pip version of Hail depends on pyspark. The PyPI PySpark package is not compatible with EMR’s Spark installation. It’s probably sufficient to pip3 uninstall pyspark, but we generally try to avoid installing it in the first place. We provide make -C hail install-on-cluster for this purpose. You might also try Amazon’s published documentation about using Hail on EMR.

Hi @danking! Thanks so much for your response. I am indeed using EMR, which could be the source of my problems. Your suggested command

hadoop fs -ls s3://YOUR_BUCKET_HERE

works fine. I can run s3 operations from the command line - I only run into problems when I try to read s3 with hail.
You may be right that installing hail on a fresh cluster would solve the problem, or uninstalling pyspark, or going the make -C hail install-on-cluster route, or reading the EMR Hail docs at https://aws.amazon.com/quickstart/architecture/hail/ more closely. I will explore these avenues and let you know what works.

1 Like

Got this to work by following the cluster-specific instructions at https://hail.is/docs/0.2/install/other-cluster.html supplied by @danking. There were a couple of steps I had to modify:

Since liblz4-dev apparently conflicts with some EMR packages, before the git step I run

sudo yum install lz4-devel

To fix installation errors errors relating to not being able to find “jni.h”, I run

find /usr/lib -name jni.h ## to id JAVA path export 
export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64

Run install command using EMR specs…

sudo make install-on-cluster \ HAIL_COMPILE_NATIVES=1 \
    SPARK_VERSION=3.1.1 \
    SCALA_VERSION=2.12.10

To get pyspark to play nicely with hail,

sudo pip3 install pyspark==3.1.1

Finally run the s3-connector script. And we’re good to go - my Hail can now read VCFs from s3.

1 Like