Repartitioning after significant filtering

grhodes26 · February 6, 2026, 11:07pm

Hi!

I’m using hail within a cloud environment to analyze srWGS data from All of Us and running into difficulty with having a large number of partitions after performing significant filtering. The initial matrix table has about 145,000 partitions (38150243 rows and ~431000 columns) and after filtering I should have only 47 rows. Based on the documentation, I have tried repartitioning because this is such extensive filtering but it does not seem to be successfully changing the number of partitions (with mt.repartition(n, shuffle=True), mt.repartition(n, shuffle=False), or naive_coalesce(n)).

Here is the log from trying to repartition with shuffling as it is my understanding this would be best after this much filtering: https://drive.google.com/file/d/1Vp5nm0Hrh8CDtIcjISqB1DFPFQwKobms/view?usp=sharing

Relevant lines of code for filtering:

exome_mt = hl.read_matrix_table(exome_mt_path)

exome_mt = hl.split_multi_hts(exome_mt)

vat_TPTE = vat_TPTE.key_by(locus=hl.locus(vat_TPTE.contig, hl.int32(vat_TPTE.position), reference_genome=“GRCh38”),
alleles=hl.array([vat_TPTE.ref_allele, vat_TPTE.alt_allele]))

exome_mt = exome_mt.semi_join_rows(vat_TPTE)

exome_mt = exome_mt.filter_rows(~hl.any(exome_mt.filters.contains(“NO_HQ_GENOTYPES”),
exome_mt.filters.contains(“EXCESS_ALLELES”),
exome_mt.filters.contains(“LowQual”)))

related_remove = hl.import_table(related_samples_path,
types={“sample_id”:“tstr”},
key=“sample_id”)

exome_mt = exome_mt.anti_join_cols(related_remove)

exome_mt_re = exome_mt.repartition(8000, shuffle=True)

Trying to check the number of partitions, it still shows 145k+ partitions:

Apologies if this is a naive question as this is my first time using Hail, any suggestions would be appreciated!

Best,

Grace

Topic		Replies	Views
Best way to repartition heavily-filtered matrix tables? Hail Query & hailctl	10	757	August 24, 2021
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1768	December 20, 2018
Subsetting large data Hail Query & hailctl	3	747	September 28, 2022
Repartition vs repartition on read Hail Query & hailctl	2	459	March 15, 2022
Table partitioning Hail Query & hailctl	1	409	July 26, 2021

Repartitioning after significant filtering

Related topics