Timeout waiting for connection from pool - loading gVCF from S3

carleshf · August 18, 2021, 9:33am

Dear all,

We are migrating from HDFS to S3 and our previous pipeline is now raising the following error:

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 4.0 failed 4 times, most recent failure: Lost task 2.3 in stage 4.0 (TID 29) (10.0.99.51 executor 0): java.io.InterruptedIOException: getFileStatus on s3a://.../XXXX.vcf.gz.tbi: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
	[...]
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
	[...]
Caused by: com.amazonaws.thirdparty.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
	[...]

After some tests, I am able to load 3-4 gVCF with our pipeline but loading more of them at the same time or iteratively it raises the same error.

The setting is:

spark_conf = SparkConf().setAppName('genetic-pipeline')
spark = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt('dfs.block.size', 536870912)
spark.sparkContext._jsc.hadoopConfiguration().setInt('parquet.block.size', 536870912)
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'ABC')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'ABC')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 'X.X.X.X:YYYY')
spark.sparkContext._jsc.hadoopConfiguration().set('spark.python.worker.reuse', 'true')

The code I am using follows (intervals and the list of experiments are global):

def transformFile(mt):
  return transform_gvcf(mt.annotate_rows(
    info = mt.info.annotate(MQ_DP = hl.missing(hl.tint32), VarDP = hl.missing(hl.tint32), QUALapprox = hl.missing(hl.tint32))
  ))
    
def importFiles(files):
  return  hl.import_vcfs(
    files,
    partitions = interval[ 'interval' ], 
    reference_genome = interval[ 'reference_genome' ], 
    array_elements_required = interval[ 'array_elements_required' ]
  )

vcfs = [ transformFile(mt) for mt in importFiles(experiments_1) ] # First pack 3 gVCF
comb1 = combine_gvcfs(vcfs)
comb1.write("s3a://.../test1", overwrite = True)

vcfs = [ transformFile(mt) for mt in importFiles(experiments_2 ) ] # Second pack of 3 gVCF
comb2 = combine_gvcfs(vcfs)
comb2.write("s3a://.../test2", overwrite = True) # <-- in stops here raising the error at import_vcfs

comb = combine_gvcfs([comb1, comb2])
comb.write("s3a://.../test", overwrite = True)

Any help elucidating why the connections are not re-used and or not terminated is welcome.

Abhishek · November 5, 2021, 5:10am

Hi @carleshf,

Were you able to solve this bug? Did you figure out the root cause for this issue?
I am also facing the same error.

Abhishek · November 8, 2021, 7:36am

@danking and @tpoterba

Do you have any thoughts on the above?

Context: We are trying to load and merge multiple chromosomes(in terms of multiple vcd files) using hail in Lifebit environment. When we do it with one chromosome, it works fine. But when we try to do it with more than one chromosome, it fails with error(at hail import time):

Error summary: ConnectionPoolTimeoutException: Timeout waiting for connection from pool

We have tried updating spark configuration parameters to increase s3a connections but that didn’t work. We also tried out on a larger cluster but that didn’t work either.

Abhishek · November 15, 2021, 10:19am

It turned out to be the spark configuration parameter - maxConnections was set low. We had to set it at 1.2*Number of cores to make sure there are enough connections.

Topic		Replies	Views
S3 connection error Hail Query & hailctl	5	853	September 28, 2020
Anyone also trying to run Hail on AWS EMR clusters and having issues? Let's huddle Hail Batch & General Cloud	3	864	November 28, 2021
Union of columns Hail Query & hailctl	7	1668	May 31, 2023
I am trying to import_vcf but I receive an error "ConnectionClosedException: Premature end of Content-Length delimited message body" Hail Query & hailctl	2	4660	August 19, 2019
Import_vcf failure on multiple inputs Hail Query & hailctl	23	1715	January 16, 2023

Timeout waiting for connection from pool - loading gVCF from S3

Related topics