I am using Hail to read vcfs from a S3 bucket. I init hail and then import the vcfs from S3a. All is ok for maybe two or three hours of analysis, but then the connection is lost and output an error. If I try to stop and init the Hail each time I import a vcf then is ok but as the analysis is distributed and needs to create a lot of files I finally get a too many open files and also the performance is lower and lower. So, I don’t know if you have worked with S3 directly and how you achieved.
What does your pipeline look like? What function are you using to write your output?
The pipeline is very simple:
There’s more information here: AttributeError: 'DataFrame' object has no attribute 'to_spark', but we’ve had trouble with S3 timing out on long pipelines. You might try using
stage_locally. If you share the hail log file and the stack traces, we might be able to help further.
Also, you probably do not want to use GT.export. That produces a text file. Try writing to a MatrixTable with
mt.write(...). What do you plan to use the tab separated file of genotypes for?
Yes, I changed the spark.hadoop.fs.s3a.connection.maximum to a big number but as I said I get an error when maybe I have read ten or twelve files (are very big files) . Now my pipeline needs to use the AWS CLI to cp files from S3 to my EC2 and then import in Hail. This is now working stable and without error, but of course with less performance than reading directly from S3.
I use GT.export because I need a text file for my posprocessing.