shuffle.FetchFailedException error

Hello,

I extract the column table including sample information from the Matrix table file as below.

mt_qced = hl.read_matrix_table('gs://bucket/QCed.MT')
mt_qced= hl.sample_qc(mt_qced)

sample_info_file='gs://bucket/sampleinfo.txt'
mt_qced.cols().flatten().export(sample_info_file)

but it failed with the same error message in several runs.

Failed to connect to cluster-sw-m90f.c.project.internal:7337 at org.apache.spark.storage.ShuffleBlockFetcherIterator …
Caused by: java.io.IOException: Failed to connect to cluster-sw-m90f.c.project.internal:7337

I ran it in Dataproc in GCP with auto-scale. Please let me know any possible solutions. Thank you.

-Jina

Hi Jina, sorry you’re hitting this problem!

This is an error we’re very familiar with, and is a result of the interaction between Spark shuffles and GCP preemptible nodes. Spark implements shuffles using all-to-all communication between nodes in your cluster, with important temporary data stored on each node. If a single node fails (or is preempted), the shuffle must be recomputed in its entirety, or sometimes throws the error you’re seeing (FetchFailedException because a node can’t fetch a block of data from another machine that’s been preempted).

sample_qc, and MatrixTable.annotate_cols more generally, is implemented using a shuffle to perform a tree aggregation that alleviates memory/cpu pressure on the driver node from large aggregations. We have some infrastructure coming soon that will let us use another execution strategy instead of a shuffle, but until then, it is safest to run big sample_qc jobs on non-preemptible workers only..

Hi Tim,

Thank you so much for your explanation. I will try to run it on preemptible workers only.

-Jina

Hi Tim,

Unfortunately, I got the same error again. What else could we suspect?

-Jina

Can you share the full stack trace? Also, to clarify, you ran this on primary / non-preemptible (-w / --num-workers non-zero, with -p / --num-secondary-workers equal to 0 and autoscaling off), right?

Thank you so much. Your suggestion made it work well.
-Jina