[ERROR] An error occurred: HailException: RelationalSetup.writeMetadata: file already exists: output.filtered_data

smareedu · January 18, 2025, 9:01am

Here is my code:

import hail as hl
import logging

Configure logger

logging.basicConfig(level=logging.INFO, format=“%(asctime)s [%(levelname)s] %(message)s”)
logger = logging.getLogger(name)

def main():
logger.info(“Initializing Hail with Spark…”)
hl.init(
app_name=“VCF_DATA Processing”,
log=“/tmp/hail.log”,
spark_conf={
# Explicitly set JAR files for AWS S3 access
“spark.jars”: “/home/ubuntu/miniconda3/envs/hail/lib/python3.9/site-packages/pyspark/jars/hadoop-aws-3.2.0.jar,”
“/home/ubuntu/miniconda3/envs/hail/lib/python3.9/site-packages/pyspark/jars/aws-java-sdk-bundle-1.11.375.jar,”
“/home/ubuntu/miniconda3/envs/hail/lib/python3.9/site-packages/pyspark/jars/spark-hadoop-cloud_2.12-3.5.0.jar”,
# Use the default credentials provider chain for AWS S3 access
“spark.hadoop.fs.s3a.aws.credentials.provider”: “com.amazonaws.auth.DefaultAWSCredentialsProviderChain”,
“spark.hadoop.fs.s3a.endpoint”: “s3.amazonaws.com”,
“spark.hadoop.fs.s3a.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,
“spark.hadoop.fs.s3a.connection.maximum”: “1000”,
“spark.hadoop.fs.s3a.attempts.maximum”: “10”,
“spark.hadoop.fs.s3a.retry.interval”: “100ms”,
“spark.hadoop.fs.s3a.connection.timeout”: “5000”,
“spark.hadoop.fs.s3a.connection.establish.timeout”: “5000”,
“spark.hadoop.fs.s3a.threads.max”: “10”,
“spark.hadoop.fs.s3a.connection.ssl.enabled”: “true”,
}
)

input_path = "s3a://emr-demo-test-868/dragen.vcf.gz"
output_path = "s3a://emr-demo-test-868/output/out.vcf.bgz"

try:
    logger.info("Reading VCF file from S3...")
    vcf_data = hl.import_vcf(input_path, force_bgz=True, reference_genome='GRCh38', skip_invalid_loci=True)
    logger.info("Displaying VCF file from S3...")
    vcf_data.show(5)
    logger.info("Filtering Data...")
    filtered_data = vcf_data.filter_rows(vcf_data.info.IC > 1.0)

    logger.info("Writing output to S3...")
    filtered_data.write(output_path, overwrite=True)


    logger.info("Performing sample QC...")
    vcf_data = hl.sample_qc(vcf_data)

    logger.info("Processing complete.")
except Exception as e:
    logger.error(f"An error occurred: {e}")
finally:
    hl.stop()

if name == “main”:
main()

And I am getting filesystem error while writing to output in s3.

showing the first 0 of 54 columns
2025-01-18 09:09:46,153 [INFO] Filtering Data…
2025-01-18 09:09:46,157 [INFO] Writing output to S3…
2025-01-18 09:09:46,158 [ERROR] An error occurred: no file system found for url s3a://emr-demo/output/out.vcf.bgz
2025-01-18 09:09:56,563 [INFO] Closing down clientserver connection

Any immediate suggestions would be greatly appreciated.

Topic		Replies	Views
:Error processing VCF: no file system found for url s3a://demo-test-868/output/out.vcf.bgz Hail Query & hailctl	0	13	January 22, 2025
No Filesystem for scheme "s3" with import_vcf Hail Query & hailctl	3	939	September 29, 2022
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3" Hail Query & hailctl	11	15521	September 28, 2022
S3 connection error Hail Query & hailctl	5	857	September 28, 2020
Unable to write matrix tables to MinIO S3 storage Hail Batch & General Cloud	1	212	March 29, 2024

[ERROR] An error occurred: HailException: RelationalSetup.writeMetadata: file already exists: output.filtered_data

Configure logger

Related topics