BlockMatrix specification?

biona001 · February 10, 2023, 2:03am

Hi community,

I wonder if there is a BlockMatrix file format specification? Something like the PGEN spec?

I downloaded a bunch of BlockMatrices from the gnomAD LD panel, which seems to be binary-compressed, and I want to write a parser in Julia that can read the (i,j)th entry like bm[i,j]. I read the source code but I can’t seem to find the info I need.

Any help/suggestion is appreciated. Thank you.

biona001 · February 10, 2023, 3:18am

After some experimentation, in python, accessing a single BlockMatrix entry (e.g. via bm[0,0]) takes ~0.3 seconds. Thus, building a 10 by 10 matrix takes almost 7 seconds. I don’t know if this performance is expected?

Here is a naive Julia parser that does essentially what I want. But it internally calls python code, so the speed is the same. I’d like to be able to do something like bm[1:10000, 1:10000] reasonably fast (within a few seconds?), if this is possible.

tpoterba · February 10, 2023, 3:19am

It would be possible to write a reader in Python, but you won’t be able to randomly access elements – these files are compressed with LZ4.

We haven’t documented our file formats because they are subject to change and we don’t want to external code depending on them, though in practice we haven’t changed our formats in a while (though it may happen soon!)

These files are compressed in blocks. The easiest thing is to decompress each block first.

Each block is:

four bytes comprising 32-bit int BLOCK_LEN, followed by four bytes comprising 32-bit int DECOMP_LEN, followed by (BLOCK_LEN-4) compressed bytes. These bytes are compressed using LZ4.

Once you’ve decompressed and concatenated (virtually or physically) these blocks, the matrix block is specified by:

four bytes comprising 32-bit int N_ROWS
four bytes comprising 32-bit int N_COLS
a byte denoting IS_TRANSPOSE (1 if transposed, 0 if not)
N_ROWS * N_COLS * 8-byte float64s, row-major if IS_TRANSPOSE, column-major if not IS_TRANSPOSE.

Topic		Replies	Views
BlockMatrix to_numpy for small region throws OutOfMemoryError: Java heap space Hail Query & hailctl	0	163	March 3, 2023
Speeding up filtering and conversion of BlockMatrix Hail Query & hailctl	0	14	January 16, 2025
Sparse mt entries question Hail Query & hailctl	5	488	November 7, 2019
LD matrix writing doesn't parallelize correctly and yields very large files Hail Query & hailctl	5	378	March 29, 2023
MatrixTable format on disk Hail Query & hailctl	5	238	March 23, 2023

BlockMatrix specification?

Related topics