Hi,
I’m attempting to set up Hail to both read from and write matrix tables to a MinIO (S3 compatible) storage, but so far I’ve only been successful in reading them.
I’m using Spark 3.3.2. I have installed both hadoop-aws-3.3.2.jar
and aws-java-sdk-bundle-1.11.1026.jar
and I have configured spark-defaults.conf
as
# generic
spark.serializer=org.apache.spark.serializer.KryoSerializer
# hail
spark.jars=/home/ubuntu/venv/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar
spark.driver.extraClassPath=/home/ubuntu/venv/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,com.amazonaws.auth.profile.ProfileCredentialsProvider,com.amazonaws.auth.profile.ProfileCredentialsProvider,org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
I’m starting Hail as
import hail as hl
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
spark_conf = SparkConf().setAppName("hail-test")
spark_conf.set("spark.hadoop.fs.s3a.endpoint", "http://some_minio_host:9000/")
spark_conf.set("spark.hadoop.fs.s3a.access.key", "some_user")
spark_conf.set("spark.hadoop.fs.s3a.secret.key", "some_password" )
spark_conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
spark_conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc = SparkContext(conf=spark_conf)
hl.init(sc=sc)
With this configuration, I am able to read matrix tables correctly, via, e.g.,
# test read
mt = hl.read_matrix_table("s3a://data-hail/some_matrix_table.mt")
mt.rows().select().show(5)
but I’m not able to write matrix tables to the MinIO storage, as
# test write, s3a scheme
mt.write("s3a://data-hail/new_matrix_table.mt")
raises this error:
ValueError: no file system found for url s3a://data-hail/new_matrix_table.mt
If change the URI scheme to s3:
# test write, s3 scheme
mt.write("s3://data-hail/new_matrix_table.mt")
I get a different error:
Hail version: 0.2.128-eead8100a1c1
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Is it possible to resolve this in order to write matrix tables to MinIO?
Thanks!