Hi hail team!
My understanding is that repartitioning a Table on read (hl.read_table(path, _n_partitions=n_partitions)
) is better than running ht.repartition(n_partitions)
. However, I’ve noticed that repartitioning a Table on read isn’t increasing the number of partitions in an input Table to the desired number, but running ht.repartition()
correctly increases the number of partitions.
For example:
This is the schema of the input Table
----------------------------------------
File Type: Table
Partitions: 4906
Rows: 100
Empty partitions: 4806
Min(rows/partition): 0
Max(rows/partition): 1
Median(rows/partition): 0.0
Mean(rows/partition): 0
StdDev(rows/partition): 0
----------------------------------------
Global fields:
'min_window_size': int32
----------------------------------------
Row fields:
'transcript': str
'total_oe': float64
'max_idx': int32
'transcript_start': int32
'transcript_end': int32
'cum_obs': array<int64>
'cum_exp': array<float64>
'positions': array<int32>
'list_len': int32
----------------------------------------
Key: ['transcript']
----------------------------------------
and this is the code I’m running:
ht = hl.read_table('gs://gnomad-tmp/kc/test_100_over5k.ht')
split_window_size = 500
ht = ht.filter(ht.transcript == 'ENST00000217939')
ht = ht.annotate(
start_idx=hl.flatmap(
lambda i: hl.map(
lambda j: hl.struct(i_start=i, j_start=j),
hl.range(0, ht.list_len, split_window_size),
),
hl.range(0, ht.list_len, split_window_size),
)
)
ht = ht.explode("start_idx")
ht = ht.annotate(i=ht.start_idx.i_start, j=ht.start_idx.j_start)
ht = ht._key_by_assert_sorted("transcript", "i", "j")
ht = ht.filter(ht.j >= ht.i)
ht = ht.annotate(
i_max_idx=hl.min(ht.i + 500, ht.list_len - 1),
j_max_idx=hl.min(ht.j + 500, ht.list_len - 1),
)
ht = ht.annotate(
start_idx=ht.start_idx.annotate(
j_start=hl.if_else(
ht.start_idx.i_start == ht.start_idx.j_start,
ht.start_idx.j_start + 1,
ht.start_idx.j_start,
),
),
)
n_rows = ht.count()
print(n_rows)
the print statement here prints 120.
Running
ht.write('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', overwrite=True)
ht = hl.read_table('gs://gnomad-tmp/kc/mxra5_rep_on_read_test.ht', _n_partitions=n_rows)
ht.n_partitions()
prints 20
.
However, running
ht = ht.repartition(n_rows)
ht = ht.checkpoint('gs://gnomad-tmp/kc/mxra5_repart_test.ht', overwrite=True)
ht.n_partitions()
prints 120
.
Is there a reason why repartitioning the table on read is only repartitioning this table to 20 partitions? I’d appreciate any insight – thanks in advance!
Edit – here’s the hail log:
repartition_test.log (1.1 MB)