Hi community,
I wonder if there is a BlockMatrix file format specification? Something like the PGEN spec?
I downloaded a bunch of BlockMatrices from the gnomAD LD panel, which seems to be binary-compressed, and I want to write a parser in Julia that can read the (i,j)th entry like bm[i,j]
. I read the source code but I can’t seem to find the info I need.
Any help/suggestion is appreciated. Thank you.
After some experimentation, in python, accessing a single BlockMatrix
entry (e.g. via bm[0,0]
) takes ~0.3 seconds. Thus, building a 10 by 10 matrix takes almost 7 seconds. I don’t know if this performance is expected?
Here is a naive Julia parser that does essentially what I want. But it internally calls python code, so the speed is the same. I’d like to be able to do something like bm[1:10000, 1:10000]
reasonably fast (within a few seconds?), if this is possible.
It would be possible to write a reader in Python, but you won’t be able to randomly access elements – these files are compressed with LZ4.
We haven’t documented our file formats because they are subject to change and we don’t want to external code depending on them, though in practice we haven’t changed our formats in a while (though it may happen soon!)
These files are compressed in blocks. The easiest thing is to decompress each block first.
Each block is:
- four bytes comprising 32-bit int BLOCK_LEN, followed by four bytes comprising 32-bit int DECOMP_LEN, followed by (BLOCK_LEN-4) compressed bytes. These bytes are compressed using LZ4.
Once you’ve decompressed and concatenated (virtually or physically) these blocks, the matrix block is specified by:
- four bytes comprising 32-bit int N_ROWS
- four bytes comprising 32-bit int N_COLS
- a byte denoting IS_TRANSPOSE (1 if transposed, 0 if not)
- N_ROWS * N_COLS * 8-byte float64s, row-major if IS_TRANSPOSE, column-major if not IS_TRANSPOSE.
1 Like