Hashing of MatrixTable


I want to create a hash that represents the current state of the MatrixTable (whats rows are present, what cols are present, etc.). Is there an easy way to do this? The intended use for this is to prevent rerunning analyses that have already been run and the results stored.

I want to basically create a hash associated with each analysis and store that as the filename of the results table that I output. When someone else wants to run an experiment, I will first check if the current MatrixTable’s hash matches that of a MatrixTable that may have been created in the past, as if both MatrixTables’s hashes are identical then the results will be as well.


You could try to hash the underlying intermediate representation ._tir or ._mir but those are likely to change when the Hail version changes.

I think you’ll have a much better time tracking what you know about the Table / MatrixTable and using that information as a cache key.

In general, this kind of caching is really tricky to get right in a general purpose way. Hail Tables and Matrix Tables are recipes, not realized data. The “rows [that] are present” aren’t known until Hail executes a collect or a write, for example. If you know that the inputs haven’t changed and the intermediate representation hasn’t changed, then I think you should get the same results.