Repartition vs repartition on read

Hi hail team!

My understanding is that repartitioning a Table on read (hl.read_table(path, _n_partitions=n_partitions)) is better than running ht.repartition(n_partitions). However, I’ve noticed that repartitioning a Table on read isn’t increasing the number of partitions in an input Table to the desired number, but running ht.repartition() correctly increases the number of partitions.

For example:
This is the schema of the input Table

----------------------------------------
File Type: Table
    Partitions: 4906
    Rows: 100
    Empty partitions: 4806
    Min(rows/partition): 0
    Max(rows/partition): 1
    Median(rows/partition): 0.0
    Mean(rows/partition): 0
    StdDev(rows/partition): 0
----------------------------------------
Global fields:
    'min_window_size': int32
----------------------------------------
Row fields:
    'transcript': str
    'total_oe': float64
    'max_idx': int32
    'transcript_start': int32
    'transcript_end': int32
    'cum_obs': array<int64>
    'cum_exp': array<float64>
    'positions': array<int32>
    'list_len': int32
----------------------------------------
Key: ['transcript']
----------------------------------------

and this is the code I’m running:

ht = hl.read_table('gs://gnomad-tmp/kc/test_100_over5k.ht')
split_window_size = 500
ht = ht.filter(ht.transcript == 'ENST00000217939')
ht = ht.annotate(
    start_idx=hl.flatmap(
        lambda i: hl.map(
            lambda j: hl.struct(i_start=i, j_start=j),
            hl.range(0, ht.list_len, split_window_size),
        ),
        hl.range(0, ht.list_len, split_window_size),
    )
)
ht = ht.explode("start_idx")
ht = ht.annotate(i=ht.start_idx.i_start, j=ht.start_idx.j_start)
ht = ht._key_by_assert_sorted("transcript", "i", "j")
ht = ht.filter(ht.j >= ht.i)
ht = ht.annotate(
    i_max_idx=hl.min(ht.i + 500, ht.list_len - 1),
    j_max_idx=hl.min(ht.j + 500, ht.list_len - 1),
)
ht = ht.annotate(
    start_idx=ht.start_idx.annotate(
        j_start=hl.if_else(
            ht.start_idx.i_start == ht.start_idx.j_start,
            ht.start_idx.j_start + 1,
            ht.start_idx.j_start,
        ),
    ),
)
n_rows = ht.count()
print(n_rows)

the print statement here prints 120.

Running

ht.write('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', overwrite=True)
ht = hl.read_table('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', _n_partitions=n_rows)
ht.n_partitions()

prints 20.

However, running

ht = ht.repartition(n_rows)
ht = ht.checkpoint('gs://gnomad-tmp/kc/mxra5_repart_test.ht', overwrite=True)
ht.n_partitions()

prints 120.

Is there a reason why repartitioning the table on read is only repartitioning this table to 20 partitions? I’d appreciate any insight – thanks in advance!

Edit – here’s the hail log:
repartition_test.log (1.1 MB)

You’ve only got 100 rows – many of the partitioning operations have a bit of noise/error when the number of rows is close to the number of partitions. In particular, repartitioning to a larger number of partitions than the number of rows shouldn’t be expected to produce that number of partitions (which includes empty partitions). I don’t think this is something to be super concerned about – is there a reason you need more than 20 partitions here?

thanks for responding! yes, the reason why I’m trying to increase the number of partitions is because I’m running hl.experimental.loop on each row of the table, and my basic understanding is that having a larger number of partitions will speed up the loop operation