Sparse mt entries question

I’m trying to explore the reference block intervals in a sparse mt with 10 samples, but I’m having sporadic issues. For example, my first attempt to write out mt.entries() ran for over three hours without completing. The next attempt (the following day, on a new cluster) wrote the table within five minutes. Similarly, I just tried to export a subset of the entries table, and my first attempt was frozen for 10 minutes. When I started the command again, it wrote a bgzipped tsv within 5 seconds.

Is there a better way to look at the reference blocks in a sparse mt? Has anyone else reported strange time differences in processing the same data recently?

You definitely do not want to use entries(), this converts the efficient MatrixTable format into an inefficient “coordinate” representation where each entry of the matrix is represented as a triplet: the row key, the column key, and the entry.

Can you give an example of an output format you would find useful?

Do you want per-sample statistics about reference block lengths?

Ahh, okay, thanks!

I want to check the overlap between the entries and our exome calling intervals – so I’m mostly just pulling the END field out of the entries and looking at that in combination with the locus. Should I use localize_entries instead?

Does mt.END.show() give you something reasonable?

actually, I think I figured it out. I’m trying to compare the intervals covered in the sparse mt to our calling intervals (stored in a ht). I ended up using mt.annotate_rows(hl.agg.max(mt.END)). sorry for the confusing question, still wrapping my head around the sparse format. thanks for the help!

We all are :wink: