Is there a way to control RAM usage? I’ve been testing Hail on a 56-core 512 GB RAM machine, and have noticed that Hail consistently uses only 95-105 GB of RAM, regardless of task or data. Is there a way to instruct Hail to use more RAM, and would there be any benefit to this?
You have nearly 10GB per core, far more than is typical. For example, standard Google Dataproc cores have 3.75GB, and even the high-memory ones only have 6.5GB. We’ve consciously written Hail to operate within these constraints, so the short answer is, no, you can’t take advantage of more RAM, and the RAM usage you’re seeing is about what we’d expect from 56 cores.
The longer answer is that there are a few situations where one implementation strategy is faster but more memory intensive than another. For example, when computing a kinship matrix X * X^T from a matrix X of genotypes distributed by variant, one could use Spark’s RowMatrix.computeGrammian or convert to BlockMatrix and use BlockMatrix multiplication. The former is faster but requires every core to store two copies of an n x n matrix of doubles (one copy accumulates into the other), where n is the number of samples; so thats about 2 * 8 * n^2 bytes per core, which grows quickly with n. For computing kinship in linear mixed models (currently a pull request), I’ve chosen a cut-off of n = 3000 for switching from the former to the latter method, but I also include “advanced” options to force a particular implementation, and on your machine you may find that its possible and worth using computeGrammian at larger sample sizes.