Sparse mt entries question

ch-kr · November 7, 2019, 6:40pm

I’m trying to explore the reference block intervals in a sparse mt with 10 samples, but I’m having sporadic issues. For example, my first attempt to write out mt.entries() ran for over three hours without completing. The next attempt (the following day, on a new cluster) wrote the table within five minutes. Similarly, I just tried to export a subset of the entries table, and my first attempt was frozen for 10 minutes. When I started the command again, it wrote a bgzipped tsv within 5 seconds.

Is there a better way to look at the reference blocks in a sparse mt? Has anyone else reported strange time differences in processing the same data recently?

danking · November 7, 2019, 6:55pm

You definitely do not want to use entries(), this converts the efficient MatrixTable format into an inefficient “coordinate” representation where each entry of the matrix is represented as a triplet: the row key, the column key, and the entry.

Can you give an example of an output format you would find useful?

Do you want per-sample statistics about reference block lengths?

ch-kr · November 7, 2019, 7:12pm

Ahh, okay, thanks!

I want to check the overlap between the entries and our exome calling intervals – so I’m mostly just pulling the END field out of the entries and looking at that in combination with the locus. Should I use localize_entries instead?

danking · November 7, 2019, 7:20pm

Does mt.END.show() give you something reasonable?

ch-kr · November 7, 2019, 8:21pm

actually, I think I figured it out. I’m trying to compare the intervals covered in the sparse mt to our calling intervals (stored in a ht). I ended up using mt.annotate_rows(hl.agg.max(mt.END)). sorry for the confusing question, still wrapping my head around the sparse format. thanks for the help!

danking · November 7, 2019, 8:23pm

We all are

Topic		Replies	Views
Matrix table entries and to_spark() Hail Query & hailctl	4	769	October 17, 2018
Extracting entry fields into a separate MatrixTable Hail Query & hailctl	8	533	December 14, 2021
Checking interval overlap Hail Query & hailctl	0	381	February 21, 2020
Densify sparse mt Hail Query & hailctl	6	516	November 21, 2019
Sparse matrix table file size is larger than densified matrix table Hail Query & hailctl	6	446	October 5, 2021

Sparse mt entries question

Related topics