When is it necessary to modify block_size when using import_vcf?

01011 · February 6, 2023, 2:33pm

As I understand, the default partition size (block_size) is 128 MiB when running import_vcf. Are there any scenarios where it would be beneficial to increase this to, for example, 1024 MiB or even 4096 MiB?

I’m not exactly seeing a performance boost when I increase block_size in import_vcf but my tests have been all limited to timing n_partitions() and count() commands right after running import_vcf with different block_sizes.

My other related question is whether it would be possible in any way to output the size of each partition in Hail? In other words, I want to see the distribution of partition sizes and identify the smallest and largest partitions. Is this possible? I have searched a lot but found nothing specific to Hail.

danking · February 6, 2023, 10:52pm

In our experience, a block size around 128MiB gives a good balance of: amount of work per partition versus amount of parallelism in the dataset.

Different pipelines will experience different speeds depending on the partitioning. Larger partitions result in less parallelism which is bad for operations that are time consuming per-row (e.g. cubic operations like linear regression). Smaller partitions are bad for operations that are really fast per-row (e.g. counting) because your time is dominated by partition management instead of per-row operations.

You generally shouldn’t run tests or analyze directly from VCF because its a slow & text-based format. Import your VCF and then write to matrix table. Then experiment there with the effects of partitioning. You can change the partitioning of a matrix table with repartition.

Hail has this information internally. It’s in the metadata.json.gz file. We could probably expose this directly to the user, but we would prefer to make that an implementation detail. In fact, for any non-trivial matrix table, the size of the partitions wouldn’t be known anymore. Consider for example mt.filter_rows(hl.rand_bool(0.5)). We cannot know, statistically, how many rows will be kept in each partition.

We are also working on functionality right now to automatically change the partitioning of tables during read so that joining two datasets together is faster. In that case, the partitioning of the dataset on disk is somewhat misleading because Hail will change it to suit the needs of the join.

01011 · February 21, 2023, 6:36pm

Thank you @danking This is very helpful!

I was just wondering which one of those metadata.json files I should look into? These is a handful of them in multiple different directories (root folder, cols, cols/rows, rows, rows/rows, entries, entries/rows, globals/rows, globals/globals, globals, index)

Another question is I have seen that with small datasets: number of partitions = number of files. But then if I take those same files and merge them into a single VCF, then number of partitions = 1. Why is this happening? I have left block_size at its default in both scenarios.

Does a 128MB VCF (bgzipped) actually fit into a single partition?

danking · February 21, 2023, 7:23pm

I think it would be helpful to step back and understand your end goal. What is your end goal?

Each metadata.json.gz file contains information about that component. If you want to know information about the entire matrix table, look at the metadata.json.gz file in the root folder.

What do you mean “number of partitions = number of files”? A Hail Table or Matrix Table will always contain more “files” than partitions (e.g. the README.txt file, the _SUCCESS file, among others). In general, I recommend against relying on the details of the Table or Matrix Table file formats. We are committed to backwards compatibility, but we sometimes introduce new versions of the file formats with new features.

I don’t think I understand your second paragraph. Can you share the code you’re executing?

Does a 128MB VCF (bgzipped) actually fit into a single partition?

I can’t answer this in general. It depends on the parameters you’ve specified to import_vcf. If you use import_vcf with a block_size larger than the size of your VCF, then, yes, you should get one partition. You can verify this yourself by executing the following code:

mt = hl.import_vcf(..., block_size=SIZE_OF_MY_VCF)
print(mt.n_partitions())

If you use hl.export_vcf(mt, ".../foo.vcf.bgz", ...) Hail writes a single VCF file. If you want multiple files, look at the documentation for the parallel keyword argument to hl.export_vcf.

01011 · March 6, 2023, 12:46pm

Thank you again @danking for being extremely helpful.

I think it would be helpful to step back and understand your end goal. What is your end goal?

I’m trying to optimize the number of partitions to speed up logistic regression and in general also understand how partitioning works in Hail and Spark. The default settings (128 MiB block_size) seem to be extremely inefficient for my dataset and I’ve noticed a significant speedup in my runtime by repartitioning data (I repartitioned my Hail MT from 50 to 3000)

If you want to know information about the entire matrix table, look at the metadata.json.gz file in the root folder.

I see that this does have the info about counts but not the actual partition sizes in MiB. Is it a correct approach to look at the size of /entries/rows/parts/part-00-xxx and /entries/rows/parts/part-01-yyy, etc.? I believe these should actually give me the size of each partition in MiB, right?

What do you mean “number of partitions = number of files”?

I meant, for a sufficiently small dataset (smaller than default block_size) → Final number of partitions after import_vcf = Number of VCF files. This is how I got to this:

Scenario 1: I had 20 vcf.gz files, each ~1MB in size. I used hl.import_vcf to import the data and after running mt.n_partitions() I saw that I had exactly 20 partitions.
Scenario 2: I took the same dataset of 20 files but merged them all into 1 vcf.gz file using bcftools and then repeated the same process above (i.e. hl.import_vcf and then mt.n_partitions()), this time I saw that my Hail MT only has 1 partition.

I also did a logistic regression test on both of these scenarios (after saving each to a Hail MT using mt.write and then reading them back into Hail) and realized that it’s a LOT faster to run logistic regression on the MT with 20 partitions. Logistic regression is even faster if I repartition the same dataset to, say, 100 partitions!

With all that said, am I wrong to assume that 128 MiB block_size is not always the best option? It looks like logistic regression’s runtime actually depends a lot more on the number of rows and columns than it does on the size of partitions (i.e. block_size), so more partitions (even when block_size is just 1MiB) works a lot better for larger datasets!

danking · March 6, 2023, 3:49pm

Thanks for the additional context!

So, you are encountering some edge conditions about partitions. There are a few ways to specify partitioning:

hl.import_vcf(..., block_size)
MatrixTable.repartition(n_partitions)
MatrixTable.naive_coalesce(n_partitions)
Most construction functions take n_partitions or _n_partitions. e.g. hl.balding_nichols_model, hl.utils.range_matrix_table.

You always get the number of requested partitions unless that is impossible. For example, a MatrixTable is partitioned by rows. If you have ten rows and twenty partitions, at least ten of those partitions (and as many as nineteen of them) must be empty. Empty partitions add overhead without any value, so, Hail attempts to not create them. For example, if you call repartition(20) on a MatrixTable with ten rows, Hail will give you ten partitions. The number of partitions p is roughly governed by:

p_{actual} = \textrm{min}(p_{requested}, n_{rows})

With respect to hl.import_vcf, I suspect you’re using force=True. See the notes at hl.import_vcf about block compressed gzip versus normal gzip. Hail cannot split a normal gzip file into multiple partitions.

Hail’s computational model involves cores and partitions. Cores are literally computer cores. In general, the cores in a Hail cluster may be on different computers. A partition is, roughly, a subset of a dataset. The Hail driver effectively does this:

cores = [...]
for partition in partitions:
    idle_core = wait_for_an_idle_core(cores)
    partition = partitions.next()
    idle_core.process(partition)

If you’re using an elastic cloud compute cluster, more partitions permit the cluster to more efficiently (i.e. cheaply) execute. If you’re using a single VM, laptop, or server and the only thing you care about is processing as fast as possible on trivial pipelines (like GWAS), then partitions = cores is the best. This will cause issues when you perform more complex operations such as group_rows_by or linear algebraic operations such as ld_matrix or genetic_releatedness_matrix.

It looks like logistic regression’s runtime actually depends a lot more on the number of rows and columns than it does on the size of partitions (i.e. block_size ), so more partitions (even when block_size is just 1MiB) works a lot better for larger datasets!

Yeah, Hail is designed to have minimal per-partition overhead. Moreover, logistic regression is slow because its an iterative fitting procedure (unlike linear regression).

[…] I’ve noticed a significant speedup in my runtime by repartitioning data (I repartitioned my Hail MT from 50 to 3000)

This surprises me. How many cores are you using? Can you share your full pipeline and any Spark settings you’ve modified from the defaults? What’s your compute environment: laptop, VM, server, on-prem cluster, cloud cluster?

Topic		Replies	Views
Time taken to read a large VCF file Hail Query & hailctl	5	793	September 15, 2020
Hail MT Directory Size Hail Query & hailctl	9	385	December 7, 2022
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024
Stage contains a task of very large size Hail Query & hailctl	9	2797	June 1, 2022
Cluster Size for Subsetting in Hail Hail Query & hailctl	3	396	March 10, 2020

When is it necessary to modify block_size when using import_vcf?

Related topics