hail.MatrixTable and Spark UI Storage tab


#1

Hello,

I’m trying to measure/estimate the sizes of objects in a Hail application. However, I’m not able to see cached/persisted hail.Table or hail.MatrixTable objects in the Spark UI Storage tab.

After following the directions for running Hail on a Spark cluster, I’m running Hail 0.2 on EMR 5.16.0 with Spark 2.3.1.

Based on the example in the linked document, I run the following:
In [1]: import hail as hl

In [2]: mt = hl.balding_nichols_model(3, 100, 100)

In [3]: mt.cache()
Out[3]: <hail.matrixtable.MatrixTable at 0x7fd46f0fbcc0>

In [4]: mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))
Out[4]: 1.0444

However, the ‘Storage’ tab of the Spark UI remains empty despite appropriate updates to the ‘Jobs’ tab:

Is this expected? Are there other suggested ways of measuring the memory sizes of hail Tables and MatrixTables?

Thanks!
Hugh


#2

cache has different semantics in Hail and Spark – in Hail, no operation mutates the underlying data or data storage behavior, which is what Spark does:

rdd.cache()  // mutates rdd without requiring assignment, though returns self
mt.cache()  // returns a cached matrix table, but doesn't mutate
mt = mt.cache()  // proper usage

You can also combine the cache with the balding_nichols_model line.


#3

I wouldn’t suggest caching everywhere willy-nilly though – we have a decent optimizer and caching explicitly checkpoints that process. There are, of course, situations where caching will speed up repeated downstream usages by a large factor.


#4

As an example, caching a matrix table from a read (mt = hl.read_matrix_table('///').cache()) actually slows down downstream computation on most systems, because the Spark caching system has a lot of overhead.


#5

Thanks, @tpoterba, that sorts it out for me!