shuffle.FetchFailedException error


I extract the column table including sample information from the Matrix table file as below.

mt_qced = hl.read_matrix_table('gs://bucket/QCed.MT')
mt_qced= hl.sample_qc(mt_qced)


but it failed with the same error message in several runs.

Failed to connect to cluster-sw-m90f.c.project.internal:7337 at …
Caused by: Failed to connect to cluster-sw-m90f.c.project.internal:7337

I ran it in Dataproc in GCP with auto-scale. Please let me know any possible solutions. Thank you.


Hi Jina, sorry you’re hitting this problem!

This is an error we’re very familiar with, and is a result of the interaction between Spark shuffles and GCP preemptible nodes. Spark implements shuffles using all-to-all communication between nodes in your cluster, with important temporary data stored on each node. If a single node fails (or is preempted), the shuffle must be recomputed in its entirety, or sometimes throws the error you’re seeing (FetchFailedException because a node can’t fetch a block of data from another machine that’s been preempted).

sample_qc, and MatrixTable.annotate_cols more generally, is implemented using a shuffle to perform a tree aggregation that alleviates memory/cpu pressure on the driver node from large aggregations. We have some infrastructure coming soon that will let us use another execution strategy instead of a shuffle, but until then, it is safest to run big sample_qc jobs on non-preemptible workers only..

Hi Tim,

Thank you so much for your explanation. I will try to run it on preemptible workers only.


Hi Tim,

Unfortunately, I got the same error again. What else could we suspect?


Can you share the full stack trace? Also, to clarify, you ran this on primary / non-preemptible (-w / --num-workers non-zero, with -p / --num-secondary-workers equal to 0 and autoscaling off), right?

Thank you so much. Your suggestion made it work well.