Shuffling and writing a MatrixTable appears to run the shuffle op twice

robert.carroll · August 23, 2021, 10:52pm

I’ve been learning Hail and Spark, and I can’t tell if the following is expected behavior. I have a large MT on GCS (~2.5T) that has poor data distribution right now (99%+ of data is on 11/1709 partitions). I’m looking to shuffle it and write it back to GCS (a new destination). I tried the following commands:

ds = hl.read_matrix_table(f'{bucket}/my_data.mt')
ds = ds.repartition(500, shuffle=True)
ds.write(f'{bucket}/shuffled.mt')

Keeping an eye on the spark ui, I noticed that it looks like it’s running the load and repartition twice before outputting it. See below:

(Note that this is just shy of a TB as I’m having to filter the data down into chunks as it was failing at the full size)

Am I misinterpreting? Is there something I am doing that would cause this or a way to avoid it?

tpoterba · August 23, 2021, 11:01pm

This is expected. The implementation of repartition does a random hash-based shuffle first to balance partitions, then resorts. It’s possible to repartition from native matrixtable files by writing and reading with the _n_partitions=... flags, but in your case that might be super expensive because of the skewed partition sizes.

How did all the data end up in 11 partitions?

robert.carroll · August 23, 2021, 11:13pm

Ok, thank you. I couldn’t tell from the documentation (or going one or two function calls deep in the code) what the _n_partitions flags would actually do, so that’s great to know.

It’s an unusual situation as I had some limitations on environment. I imported a (large) number of single sample VCFs one-by-one and unioned the columns in blocks of 20, then 20x20 (and so on). I’m guessing the smaller MTs were partitioned unusually as I was using a bunch of small hail deployments (3 executors each). In retrospect, I did see the “min” and “max” sizes of the partitions in the early phases (specifically 1 row on the small end), but I didn’t recognize the risk of skewness and wrote it off.

Topic		Replies	Views
Best way to repartition heavily-filtered matrix tables? Hail Query & hailctl	10	710	August 24, 2021
Repartition vs repartition on read Hail Query & hailctl	2	448	March 15, 2022
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1725	December 20, 2018
Hail 0.2 - method group_rows_by requieres lot of space on disk Help [0.1]	3	598	May 24, 2018
Densify sparse mt Hail Query & hailctl	6	515	November 21, 2019

Shuffling and writing a MatrixTable appears to run the shuffle op twice

Related topics