Hail Repartition returns uneven partitions with one very large partition

darn_matren · March 17, 2023, 3:17pm

Hey Hail team,

I’m working on a script that - at the start - reads in a Hail Table and outputs a sharded VCF (that the we will be annotating in parallel to add an annotation for gnomAD). However, it is outputting very uneven partitions, with the last partition being many times the size of the others. As well as being theoretically confusing, it has some negative impacts downstream on performance. I’ll copy some screenshots of my code and the outputs below. I’m running this on Query-On-Batch and included my Hailctl config as well. Let me know if there are any solutions and/or explanations as to what’s up, thanks !

tpoterba · March 17, 2023, 4:16pm

repartition() (and all partitioning operations) split the number of rows in each partition relatively evenly, which isn’t a great approach when the size of rows varies wildly. That said, there could definitely be a bug in partition selection here. If you unzip and wc -l this last file and one of the others, does it have around the same number of lines or 5x more?

darn_matren · March 17, 2023, 6:59pm

Yep, I opened them with hl.import_vcf() in a notebook and the last file had many more times the number of lines. For 10 partitions and ~1000 variants (lines) , it was split into 9 table with 21 lines and 1 table with ~811 lines or so. I’m not sure what’s up?

darn_matren · March 20, 2023, 4:30pm

New development: oddly enough, when my script reads in GnomAD v3 Release Data (from the links at gnomAD) , the partitions are even , but this odd behavior is only for Hail Tables that I had generated and worked on myself.

These tables were generated for work on another project - where I read in the v3 Sites data, then filtered and downsampled it to a number of variants with a good mix of different variant types (SNVs, Indels, variants at multiallelic sites, variants on sex chromosomes, etc) , with a set seed for randomness. Can you think of anything that could be weird/wrong with the table itself that would result in uneven partitions?

tpoterba · March 21, 2023, 10:01am

Narrowed down to a bug (it’s just totally wrong, actually, samples keys at the front of the partition with higher probability) in the lowered version of TableCalculateNewPartitions

tpoterba · March 21, 2023, 10:03am

Narrowed down to a bug (it’s just totally wrong, actually, samples keys at the front of the partition with higher probability) in the lowered version of TableCalculateNewPartitions

Topic		Replies	Views
Table partitioning Hail Query & hailctl	1	392	July 26, 2021
Shuffling and writing a MatrixTable appears to run the shuffle op twice Hail Query & hailctl	2	384	August 23, 2021
Hail MT Directory Size Hail Query & hailctl	9	384	December 7, 2022
Table file sizes are different after checkpoint/write Hail Query & hailctl	3	360	June 16, 2022
Repartition and missing values in a MT file Hail Query & hailctl	1	645	November 13, 2018

Hail Repartition returns uneven partitions with one very large partition

Related topics