Hi,
I have a question regarding a strange result I got after densifying a matrix table. When comparing the file size of my densified and sparse matrix tables, the densified matrix table was larger than the sparse one (12,975MB for the densified matrix table and 13,273MB for the sparse matrix table). This is the command I used to generate the densified matrix table:
mt = hl.read_matrix_table(sparse_matrix_table)
mt = hl.experimental.densify(mt)
mt.write(mt_path)
No repartitioning was done on either of the matrix tables, and they both have 2,586 partitions (20 samples with 167M variants).
Is there any reason as to why the sparse matrix table would be larger than the densified one? It seems counter-intuitive to me, but perhaps there’s some explanation I’m missing.
Thanks!
This is surprising to me. What is the entry schema of the matrix table? I could imagine this scenario happening with high frequency if the entry just has the call field and LA.
Confusingly, the entry schema has many fields (13)
another idea – could you try just read_matrix_table/write_matrix_table to a new path for the sparse one? there can sometimes be garbage files among the important ones.
I tested reading in the matrix table, then immediately writing it. While it did reduce in size, the difference was very small - 6.5kB.
Just to add one extra piece of information - sparse_split_multi
was run on the original matrix table during joint calling, however I’m not sure if this makes a difference.
splitting multiallelics definitely increases the size, by a lot. I think we’d recommend doing this on the fly – it’s much more efficient to loop over an array of alt alleles in memory than to actually store the duplicated row for each alternate.
That’s really helpful, and good to know. Thanks @tpoterba !