Hail Repartition returns uneven partitions with one very large partition

Hey Hail team,

I’m working on a script that - at the start - reads in a Hail Table and outputs a sharded VCF (that the we will be annotating in parallel to add an annotation for gnomAD). However, it is outputting very uneven partitions, with the last partition being many times the size of the others. As well as being theoretically confusing, it has some negative impacts downstream on performance. I’ll copy some screenshots of my code and the outputs below. I’m running this on Query-On-Batch and included my Hailctl config as well. Let me know if there are any solutions and/or explanations as to what’s up, thanks !

repartition() (and all partitioning operations) split the number of rows in each partition relatively evenly, which isn’t a great approach when the size of rows varies wildly. That said, there could definitely be a bug in partition selection here. If you unzip and wc -l this last file and one of the others, does it have around the same number of lines or 5x more?

Yep, I opened them with hl.import_vcf() in a notebook and the last file had many more times the number of lines. For 10 partitions and ~1000 variants (lines) , it was split into 9 table with 21 lines and 1 table with ~811 lines or so. I’m not sure what’s up?

New development: oddly enough, when my script reads in GnomAD v3 Release Data (from the links at gnomAD) , the partitions are even , but this odd behavior is only for Hail Tables that I had generated and worked on myself.

These tables were generated for work on another project - where I read in the v3 Sites data, then filtered and downsampled it to a number of variants with a good mix of different variant types (SNVs, Indels, variants at multiallelic sites, variants on sex chromosomes, etc) , with a set seed for randomness. Can you think of anything that could be weird/wrong with the table itself that would result in uneven partitions?

Narrowed down to a bug (it’s just totally wrong, actually, samples keys at the front of the partition with higher probability) in the lowered version of TableCalculateNewPartitions

Narrowed down to a bug (it’s just totally wrong, actually, samples keys at the front of the partition with higher probability) in the lowered version of TableCalculateNewPartitions