Repartition vs repartition on read

ch-kr · March 9, 2022, 4:41pm

Hi hail team!

My understanding is that repartitioning a Table on read (hl.read_table(path, _n_partitions=n_partitions)) is better than running ht.repartition(n_partitions). However, I’ve noticed that repartitioning a Table on read isn’t increasing the number of partitions in an input Table to the desired number, but running ht.repartition() correctly increases the number of partitions.

For example:
This is the schema of the input Table

----------------------------------------
File Type: Table
    Partitions: 4906
    Rows: 100
    Empty partitions: 4806
    Min(rows/partition): 0
    Max(rows/partition): 1
    Median(rows/partition): 0.0
    Mean(rows/partition): 0
    StdDev(rows/partition): 0
----------------------------------------
Global fields:
    'min_window_size': int32
----------------------------------------
Row fields:
    'transcript': str
    'total_oe': float64
    'max_idx': int32
    'transcript_start': int32
    'transcript_end': int32
    'cum_obs': array<int64>
    'cum_exp': array<float64>
    'positions': array<int32>
    'list_len': int32
----------------------------------------
Key: ['transcript']
----------------------------------------

and this is the code I’m running:

ht = hl.read_table('gs://gnomad-tmp/kc/test_100_over5k.ht')
split_window_size = 500
ht = ht.filter(ht.transcript == 'ENST00000217939')
ht = ht.annotate(
    start_idx=hl.flatmap(
        lambda i: hl.map(
            lambda j: hl.struct(i_start=i, j_start=j),
            hl.range(0, ht.list_len, split_window_size),
        ),
        hl.range(0, ht.list_len, split_window_size),
    )
)
ht = ht.explode("start_idx")
ht = ht.annotate(i=ht.start_idx.i_start, j=ht.start_idx.j_start)
ht = ht._key_by_assert_sorted("transcript", "i", "j")
ht = ht.filter(ht.j >= ht.i)
ht = ht.annotate(
    i_max_idx=hl.min(ht.i + 500, ht.list_len - 1),
    j_max_idx=hl.min(ht.j + 500, ht.list_len - 1),
)
ht = ht.annotate(
    start_idx=ht.start_idx.annotate(
        j_start=hl.if_else(
            ht.start_idx.i_start == ht.start_idx.j_start,
            ht.start_idx.j_start + 1,
            ht.start_idx.j_start,
        ),
    ),
)
n_rows = ht.count()
print(n_rows)

the print statement here prints 120.

Running

ht.write('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', overwrite=True)
ht = hl.read_table('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', _n_partitions=n_rows)
ht.n_partitions()

prints 20.

However, running

ht = ht.repartition(n_rows)
ht = ht.checkpoint('gs://gnomad-tmp/kc/mxra5_repart_test.ht', overwrite=True)
ht.n_partitions()

prints 120.

Is there a reason why repartitioning the table on read is only repartitioning this table to 20 partitions? I’d appreciate any insight – thanks in advance!

Edit – here’s the hail log:
repartition_test.log (1.1 MB)

tpoterba · March 15, 2022, 6:35pm

You’ve only got 100 rows – many of the partitioning operations have a bit of noise/error when the number of rows is close to the number of partitions. In particular, repartitioning to a larger number of partitions than the number of rows shouldn’t be expected to produce that number of partitions (which includes empty partitions). I don’t think this is something to be super concerned about – is there a reason you need more than 20 partitions here?

ch-kr · March 15, 2022, 6:47pm

thanks for responding! yes, the reason why I’m trying to increase the number of partitions is because I’m running hl.experimental.loop on each row of the table, and my basic understanding is that having a larger number of partitions will speed up the loop operation

Topic		Replies	Views
No change with partition_hint Hail Query & hailctl	4	367	August 8, 2023
Repartition on read and AssertionError Hail Query & hailctl	3	481	June 3, 2022
Best way to repartition heavily-filtered matrix tables? Hail Query & hailctl	10	670	August 24, 2021
Hail Repartition returns uneven partitions with one very large partition Hail Query & hailctl	5	425	March 21, 2023
Repartition on read Hail Query & hailctl	4	353	July 27, 2021

Repartition vs repartition on read

Related topics