:Error processing VCF: no file system found for url s3a://demo-test-868/output/out.vcf.bgz

smareedu · January 22, 2025, 7:39am

I am getting this error while writing the filtered data to s3 bucket. I ensure that all required jars are in place.

Any help would be greatly appreciated.

Here is my code.

import hail as hl
import logging

Configure logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(name)

def main():
hl.init(
log=“/tmp/hail.log”,
spark_conf={
“spark.hadoop.fs.s3a.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,
“spark.hadoop.fs.AbstractFileSystem.s3a.impl”: “org.apache.hadoop.fs.s3a.S3A”,
#“spark.hadoop.fs.FSDataOutputStream.s3a”: “org.apache.hadoop.fs.s3a.S3AOutputStream”,
“spark.hadoop.fs.s3a.outputstream”: “org.apache.hadoop.fs.s3a.S3AOutputStream”,
“spark.hadoop.fs.s3a.aws.credentials.provider”: “org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider”,
“spark.hadoop.fs.s3a.access.key”: “”,–removed for security reason
“spark.hadoop.fs.s3a.secret.key”: “”,removed for security reason
“spark.hadoop.fs.s3a.endpoint”: “s3.amazonaws.com”,
“spark.jars.packages”: “org.apache.hadoop:hadoop-aws:3.2.2,”
“com.amazonaws:aws-java-sdk:1.12.180,”
“io.delta:delta-core_2.12:1.1.0”,
“spark.hadoop.fs.s3a.connection.maximum”: “100”,
“spark.hadoop.fs.s3a.fast.upload.active.blocks”: “1”,
“spark.hadoop.fs.s3a.fast.upload.buffer”: “bytebuffer”,
“spark.hadoop.fs.s3a.path.style.access”: “true”,
“spark.hadoop.fs.s3a.multipart.size”: “104857600”,
“spark.hadoop.fs.s3a.fast.upload”: “true”,
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”: “2”,
“spark.speculation”: “false”,
“spark.serializer”: “org.apache.spark.serializer.KryoSerializer”,
},
)
# File paths
vcf_path = “s3a://demo-test-868/data.vcf.gz” # Original file path
output_path = “s3a://demo-test-868/output/out.vcf.bgz” # Output path

try:
    logger.info("Reading VCF file...")
    # If file is block-gzipped, use force_bgz=True. If not, pre-convert to BGZF format.
    mt = hl.import_vcf(vcf_path, force_bgz=True, reference_genome='GRCh38', array_elements_required=False)  # Change to `force=True` for non-BGZF .gz files
    logger.info("Successfully imported VCF.")

    logger.info("Processing data...")
    mt.describe()  # Example operation

    logger.info("Writing processed VCF to output...")
    mt.write(output_path, overwrite=True)
    logger.info("Process completed successfully.")

except Exception as e:
    logger.error(f"Error processing VCF: {e}")

if name == “main”:
main()

And i am getting this error while writing the processed VCF to output.
Row key: [‘locus’, ‘alleles’]

INFO:main:Writing processed VCF to output…
ERROR:main:Error processing VCF: no file system found for url s3a://emr-demo-test-868/output/out.vcf.bgz
INFO:py4j.clientserver:Closing down clientserver connection

Topic		Replies	Views
[ERROR] An error occurred: HailException: RelationalSetup.writeMetadata: file already exists: output.filtered_data Hail Batch & General Cloud	0	21	January 18, 2025
No Filesystem for scheme "s3" with import_vcf Hail Query & hailctl	3	939	September 29, 2022
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3" Hail Query & hailctl	11	15521	September 28, 2022
Import vcf from S3 Hail Query & hailctl	0	363	October 17, 2022
S3 connection error Hail Query & hailctl	5	857	September 28, 2020

:Error processing VCF: no file system found for url s3a://demo-test-868/output/out.vcf.bgz

Configure logging

And i am getting this error while writing the processed VCF to output. Row key: [‘locus’, ‘alleles’]

Related topics

And i am getting this error while writing the processed VCF to output.
Row key: [‘locus’, ‘alleles’]