Hello,
I’m trying to measure/estimate the sizes of objects in a Hail application. However, I’m not able to see cached/persisted hail.Table or hail.MatrixTable objects in the Spark UI Storage tab.
After following the directions for running Hail on a Spark cluster, I’m running Hail 0.2 on EMR 5.16.0 with Spark 2.3.1.
Based on the example in the linked document, I run the following:
In [1]: import hail as hl
In [2]: mt = hl.balding_nichols_model(3, 100, 100)
In [3]: mt.cache()
Out[3]: <hail.matrixtable.MatrixTable at 0x7fd46f0fbcc0>
In [4]: mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))
Out[4]: 1.0444
However, the ‘Storage’ tab of the Spark UI remains empty despite appropriate updates to the ‘Jobs’ tab:
Is this expected? Are there other suggested ways of measuring the memory sizes of hail Tables and MatrixTables?
Thanks!
Hugh
cache
has different semantics in Hail and Spark – in Hail, no operation mutates the underlying data or data storage behavior, which is what Spark does:
rdd.cache() // mutates rdd without requiring assignment, though returns self
mt.cache() // returns a cached matrix table, but doesn't mutate
mt = mt.cache() // proper usage
You can also combine the cache with the balding_nichols_model
line.
I wouldn’t suggest caching everywhere willy-nilly though – we have a decent optimizer and caching explicitly checkpoints that process. There are, of course, situations where caching will speed up repeated downstream usages by a large factor.
As an example, caching a matrix table from a read (mt = hl.read_matrix_table('///').cache()
) actually slows down downstream computation on most systems, because the Spark caching system has a lot of overhead.
Thanks, @tpoterba, that sorts it out for me!