Unable to write matrix tables to MinIO S3 storage

Hi,

I’m attempting to set up Hail to both read from and write matrix tables to a MinIO (S3 compatible) storage, but so far I’ve only been successful in reading them.

I’m using Spark 3.3.2. I have installed both hadoop-aws-3.3.2.jar and aws-java-sdk-bundle-1.11.1026.jar and I have configured spark-defaults.conf as

# generic
spark.serializer=org.apache.spark.serializer.KryoSerializer

# hail
spark.jars=/home/ubuntu/venv/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar
spark.driver.extraClassPath=/home/ubuntu/venv/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,com.amazonaws.auth.profile.ProfileCredentialsProvider,com.amazonaws.auth.profile.ProfileCredentialsProvider,org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider

I’m starting Hail as

import hail as hl
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

spark_conf = SparkConf().setAppName("hail-test")
spark_conf.set("spark.hadoop.fs.s3a.endpoint", "http://some_minio_host:9000/")
spark_conf.set("spark.hadoop.fs.s3a.access.key", "some_user")
spark_conf.set("spark.hadoop.fs.s3a.secret.key", "some_password" )
spark_conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
spark_conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc = SparkContext(conf=spark_conf)
hl.init(sc=sc)

With this configuration, I am able to read matrix tables correctly, via, e.g.,

# test read
mt = hl.read_matrix_table("s3a://data-hail/some_matrix_table.mt")
mt.rows().select().show(5)

but I’m not able to write matrix tables to the MinIO storage, as

# test write, s3a scheme
mt.write("s3a://data-hail/new_matrix_table.mt")

raises this error:

ValueError: no file system found for url s3a://data-hail/new_matrix_table.mt

If change the URI scheme to s3:

# test write, s3 scheme
mt.write("s3://data-hail/new_matrix_table.mt")

I get a different error:

Hail version: 0.2.128-eead8100a1c1
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3"

Is it possible to resolve this in order to write matrix tables to MinIO?

Thanks!

Update: it seems to works correctly, using the s3 URI scheme, if I additionally set this in the Spark configuration:

spark_conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")