hail.MatrixTable and Spark UI Storage tab

hcunningham · August 29, 2018, 10:12pm

Hello,

I’m trying to measure/estimate the sizes of objects in a Hail application. However, I’m not able to see cached/persisted hail.Table or hail.MatrixTable objects in the Spark UI Storage tab.

After following the directions for running Hail on a Spark cluster, I’m running Hail 0.2 on EMR 5.16.0 with Spark 2.3.1.

Based on the example in the linked document, I run the following:
In [1]: import hail as hl

In [2]: mt = hl.balding_nichols_model(3, 100, 100)

In [3]: mt.cache()
Out[3]: <hail.matrixtable.MatrixTable at 0x7fd46f0fbcc0>

In [4]: mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))
Out[4]: 1.0444

However, the ‘Storage’ tab of the Spark UI remains empty despite appropriate updates to the ‘Jobs’ tab:

Is this expected? Are there other suggested ways of measuring the memory sizes of hail Tables and MatrixTables?

Thanks!
Hugh

tpoterba · August 29, 2018, 11:01pm

cache has different semantics in Hail and Spark – in Hail, no operation mutates the underlying data or data storage behavior, which is what Spark does:

rdd.cache()  // mutates rdd without requiring assignment, though returns self

mt.cache()  // returns a cached matrix table, but doesn't mutate

mt = mt.cache()  // proper usage

You can also combine the cache with the balding_nichols_model line.

tpoterba · August 29, 2018, 11:03pm

I wouldn’t suggest caching everywhere willy-nilly though – we have a decent optimizer and caching explicitly checkpoints that process. There are, of course, situations where caching will speed up repeated downstream usages by a large factor.

tpoterba · August 29, 2018, 11:04pm

As an example, caching a matrix table from a read (mt = hl.read_matrix_table('///').cache()) actually slows down downstream computation on most systems, because the Spark caching system has a lot of overhead.

hcunningham · August 30, 2018, 1:00am

Thanks, @tpoterba, that sorts it out for me!

Topic		Replies	Views
MatrixTable cache Hail Query & hailctl	11	628	September 22, 2020
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	213	February 21, 2024
Ld_prune() out of memory Hail Query & hailctl	9	492	March 14, 2022
Hail 0.2 - method group_rows_by requieres lot of space on disk Help [0.1]	3	597	May 24, 2018
Hashing of MatrixTable Hail Query & hailctl	1	306	February 6, 2023

hail.MatrixTable and Spark UI Storage tab

Related topics