Hail with local spark mode unable to read matrix table from AWS S3 bucket

Abhishek · May 7, 2021, 6:49am

Hi all,
I am currently working on developing a python package for GWAS pipelines using hail. It works fine when I run with spark cluster mode. However, when I try to use hail using spark local mode, it throws the following error.

Is it not possible to read hail matrix tables in spark local mode using hail?

Stack trace

Traceback (most recent call last):
  File "tests/test.py", line 118, in test_use_cohort_from_source
    self.assertTrue(expr=ct.output().mt.exists(), msg=f'{ct.output().mt} does not exist!')
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/target.py", line 223, in exists
    return self.spark_state == 'complete'
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/target.py", line 215, in spark_state
    if self.fs.exists(self.completeness_file):
  File "/home/users/ab904123/piranha_package/piranha/piranha/bmrn_luigi_ext/config.py", line 186, in wrapped
    return fn(*args, **kwargs)
  File "/home/users/ab904123/piranha_package/piranha/piranha/workflows/filesystem.py", line 17, in exists
    return hl.hadoop_exists(path)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/utils/hadoop_utils.py", line 128, in hadoop_exists
    return Env.fs().exists(path)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/fs/hadoop_fs.py", line 30, in exists
    return self._jfs.exists(path)
  File "/bmrn/apps/spark/2.4.5/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/bmrn/apps/hail/0.2.42/python/hail-0.2.42-py3-none-any.egg/hail/backend/spark_backend.py", line 41, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: IllegalArgumentException: null

Java stack trace:
java.lang.IllegalArgumentException: null
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1237)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:148)
        at is.hail.io.fs.FS$class.exists(FS.scala:114)
        at is.hail.io.fs.HadoopFS.exists(HadoopFS.scala:57)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

danking · May 7, 2021, 2:54pm

Ah, this was a fun morning exercise!

Run this script: After running this, you can access `s3a://` URLs through Apache Spark. · GitHub, e.g. like this:

curl -sSL https://gist.githubusercontent.com/danking/f8387f5681b03edc5babdf36e14140bc/raw/23d43a2cc673d80adcc8f2a1daee6ab252d6f667/install-s3-connector.sh | bash

Topic		Replies	Views
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3" Hail Query & hailctl	11	15577	September 28, 2022
AttributeError: 'DataFrame' object has no attribute 'to_spark' Help [0.1]	19	7446	August 1, 2018
Using hadoop and spark to use with hail 0.2.83 Hail Batch & General Cloud	3	754	February 22, 2022
Running Hail on AWS Help [0.1]	29	3764	January 9, 2019
Hail on windows Hail Query & hailctl	14	453	April 6, 2021

Hail with local spark mode unable to read matrix table from AWS S3 bucket

Related topics