Thanks for the additional context!
So, you are encountering some edge conditions about partitions. There are a few ways to specify partitioning:
hl.import_vcf(..., block_size)
MatrixTable.repartition(n_partitions)
MatrixTable.naive_coalesce(n_partitions)
- Most construction functions take
n_partitions
or _n_partitions
. e.g. hl.balding_nichols_model
, hl.utils.range_matrix_table
.
You always get the number of requested partitions unless that is impossible. For example, a MatrixTable is partitioned by rows. If you have ten rows and twenty partitions, at least ten of those partitions (and as many as nineteen of them) must be empty. Empty partitions add overhead without any value, so, Hail attempts to not create them. For example, if you call repartition(20)
on a MatrixTable with ten rows, Hail will give you ten partitions. The number of partitions p is roughly governed by:
p_{actual} = \textrm{min}(p_{requested}, n_{rows})
With respect to hl.import_vcf
, I suspect you’re using force=True
. See the notes at hl.import_vcf
about block compressed gzip versus normal gzip. Hail cannot split a normal gzip file into multiple partitions.
Hail’s computational model involves cores and partitions. Cores are literally computer cores. In general, the cores in a Hail cluster may be on different computers. A partition is, roughly, a subset of a dataset. The Hail driver effectively does this:
cores = [...]
for partition in partitions:
idle_core = wait_for_an_idle_core(cores)
partition = partitions.next()
idle_core.process(partition)
If you’re using an elastic cloud compute cluster, more partitions permit the cluster to more efficiently (i.e. cheaply) execute. If you’re using a single VM, laptop, or server and the only thing you care about is processing as fast as possible on trivial pipelines (like GWAS), then partitions = cores
is the best. This will cause issues when you perform more complex operations such as group_rows_by
or linear algebraic operations such as ld_matrix
or genetic_releatedness_matrix
.
It looks like logistic regression’s runtime actually depends a lot more on the number of rows and columns than it does on the size of partitions (i.e. block_size
), so more partitions (even when block_size is just 1MiB) works a lot better for larger datasets!
Yeah, Hail is designed to have minimal per-partition overhead. Moreover, logistic regression is slow because its an iterative fitting procedure (unlike linear regression).
[…] I’ve noticed a significant speedup in my runtime by repartitioning data (I repartitioned my Hail MT from 50 to 3000)
This surprises me. How many cores are you using? Can you share your full pipeline and any Spark settings you’ve modified from the defaults? What’s your compute environment: laptop, VM, server, on-prem cluster, cloud cluster?