shuffle.FetchFailedException error

jinasong · July 27, 2021, 11:42pm

Hello,

I extract the column table including sample information from the Matrix table file as below.

mt_qced = hl.read_matrix_table('gs://bucket/QCed.MT')
mt_qced= hl.sample_qc(mt_qced)

sample_info_file='gs://bucket/sampleinfo.txt'
mt_qced.cols().flatten().export(sample_info_file)

but it failed with the same error message in several runs.

Failed to connect to cluster-sw-m90f.c.project.internal:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator …
Caused by: java.io.IOException: Failed to connect to cluster-sw-m90f.c.project.internal:7337

I ran it in Dataproc in GCP with auto-scale. Please let me know any possible solutions. Thank you.

-Jina

tpoterba · July 28, 2021, 12:56pm

Hi Jina, sorry you’re hitting this problem!

This is an error we’re very familiar with, and is a result of the interaction between Spark shuffles and GCP preemptible nodes. Spark implements shuffles using all-to-all communication between nodes in your cluster, with important temporary data stored on each node. If a single node fails (or is preempted), the shuffle must be recomputed in its entirety, or sometimes throws the error you’re seeing (FetchFailedException because a node can’t fetch a block of data from another machine that’s been preempted).

sample_qc, and MatrixTable.annotate_cols more generally, is implemented using a shuffle to perform a tree aggregation that alleviates memory/cpu pressure on the driver node from large aggregations. We have some infrastructure coming soon that will let us use another execution strategy instead of a shuffle, but until then, it is safest to run big sample_qc jobs on non-preemptible workers only..

jinasong · July 28, 2021, 1:16pm

Hi Tim,

Thank you so much for your explanation. I will try to run it on preemptible workers only.

-Jina

jinasong · July 28, 2021, 10:03pm

Hi Tim,

Unfortunately, I got the same error again. What else could we suspect?

-Jina

tpoterba · July 28, 2021, 11:47pm

Can you share the full stack trace? Also, to clarify, you ran this on primary / non-preemptible (-w / --num-workers non-zero, with -p / --num-secondary-workers equal to 0 and autoscaling off), right?

jinasong · July 29, 2021, 3:19pm

Thank you so much. Your suggestion made it work well.
-Jina

Topic		Replies	Views
Fail to retrieve row information of Hail matrix.table Hail Query & hailctl	5	522	July 22, 2022
Assertion failed Hail Query & hailctl	8	1399	July 8, 2019
java.io.FileNotFoundException (mt.summarize()) Hail Query & hailctl	2	334	October 28, 2022
Fail to export hl.table as a tsv file Hail Query & hailctl	9	549	April 6, 2023
Errors when computing sample qc Hail Query & hailctl	0	230	October 17, 2023

shuffle.FetchFailedException error

Related topics