Sparse matrix table file size is larger than densified matrix table

KatalinaBobowik · September 29, 2021, 1:48am

Hi,

I have a question regarding a strange result I got after densifying a matrix table. When comparing the file size of my densified and sparse matrix tables, the densified matrix table was larger than the sparse one (12,975MB for the densified matrix table and 13,273MB for the sparse matrix table). This is the command I used to generate the densified matrix table:

mt = hl.read_matrix_table(sparse_matrix_table)
mt = hl.experimental.densify(mt)
mt.write(mt_path)

No repartitioning was done on either of the matrix tables, and they both have 2,586 partitions (20 samples with 167M variants).

Is there any reason as to why the sparse matrix table would be larger than the densified one? It seems counter-intuitive to me, but perhaps there’s some explanation I’m missing.

Thanks!

tpoterba · September 29, 2021, 11:55am

This is surprising to me. What is the entry schema of the matrix table? I could imagine this scenario happening with high frequency if the entry just has the call field and LA.

KatalinaBobowik · September 30, 2021, 8:59am

Confusingly, the entry schema has many fields (13)

tpoterba · September 30, 2021, 12:52pm

another idea – could you try just read_matrix_table/write_matrix_table to a new path for the sparse one? there can sometimes be garbage files among the important ones.

KatalinaBobowik · October 1, 2021, 8:55am

I tested reading in the matrix table, then immediately writing it. While it did reduce in size, the difference was very small - 6.5kB.

Just to add one extra piece of information - sparse_split_multi was run on the original matrix table during joint calling, however I’m not sure if this makes a difference.

tpoterba · October 1, 2021, 3:46pm

splitting multiallelics definitely increases the size, by a lot. I think we’d recommend doing this on the fly – it’s much more efficient to loop over an array of alt alleles in memory than to actually store the duplicated row for each alternate.

KatalinaBobowik · October 5, 2021, 11:20pm

That’s really helpful, and good to know. Thanks @tpoterba !

Topic		Replies	Views
Densify() operation Hail Query & hailctl	0	330	February 1, 2023
Densifying VDS to MatrixTable very expensive Hail Query & hailctl	2	382	November 13, 2023
Matrix table entries and to_spark() Hail Query & hailctl	4	777	October 17, 2018
Densify sparse mt Hail Query & hailctl	6	527	November 21, 2019
Exception(s) when writing dense matrix after multiple merges of densify matrices Hail Query & hailctl	2	714	March 19, 2020

Sparse matrix table file size is larger than densified matrix table

Related topics