S3 connection error

Santiago · September 28, 2020, 2:34pm

Hi,

I am using Hail to read vcfs from a S3 bucket. I init hail and then import the vcfs from S3a. All is ok for maybe two or three hours of analysis, but then the connection is lost and output an error. If I try to stop and init the Hail each time I import a vcf then is ok but as the analysis is distributed and needs to create a lot of files I finally get a too many open files and also the performance is lower and lower. So, I don’t know if you have worked with S3 directly and how you achieved.

Santiago

johnc1231 · September 28, 2020, 2:37pm

What does your pipeline look like? What function are you using to write your output?

Santiago · September 28, 2020, 2:39pm

The pipeline is very simple:
import_vcf
filter_rows
split_multi_hts
GT.export

danking · September 28, 2020, 2:45pm

There’s more information here: AttributeError: 'DataFrame' object has no attribute 'to_spark', but we’ve had trouble with S3 timing out on long pipelines. You might try using stage_locally. If you share the hail log file and the stack traces, we might be able to help further.

danking · September 28, 2020, 2:46pm

Also, you probably do not want to use GT.export. That produces a text file. Try writing to a MatrixTable with mt.write(...). What do you plan to use the tab separated file of genotypes for?

Santiago · September 28, 2020, 3:36pm

Yes, I changed the spark.hadoop.fs.s3a.connection.maximum to a big number but as I said I get an error when maybe I have read ten or twelve files (are very big files) . Now my pipeline needs to use the AWS CLI to cp files from S3 to my EC2 and then import in Hail. This is now working stable and without error, but of course with less performance than reading directly from S3.

I use GT.export because I need a text file for my posprocessing.

Topic		Replies	Views
Timeout waiting for connection from pool - loading gVCF from S3 Science	3	1970	November 15, 2021
:Error processing VCF: no file system found for url s3a://demo-test-868/output/out.vcf.bgz Hail Query & hailctl	0	37	January 22, 2025
No Filesystem for scheme "s3" with import_vcf Hail Query & hailctl	3	1005	September 29, 2022
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3" Hail Query & hailctl	11	15796	September 28, 2022
Anyone also trying to run Hail on AWS EMR clusters and having issues? Let's huddle Hail Batch & General Cloud	3	911	November 28, 2021

S3 connection error

Related topics