Hi Hail Team,
I’m using realized_relationship_matrix() to generate a pre-computed kinship matrix for use with linear_mixed_model() and have found that it’s quite slow. Under the hood, I see that it is only using a single process, which I imagine is a factor. Do you have any suggestions for improving performance? I assume more stringent LD pruning before running it would mean fewer variants going into the function, and would improve performance? Is this something you’d suggest doing? Are there any guidelines for the number of variants that is optimal to use? Finally, is it somehow faster if the genotypes are provided directly to linear_mixed_model() with the z_t argument vs pre-computing? Sorry for the flood of questions, and thanks in advance for any help.
OK, a couple of things. First, Hail’s linear mixed model is harder to use and certainly less efficient (though more scalable) than options like BOLT-LMM and SAIGE. Most Hail users at the Broad use Hail to slice/dice/QC their data, but export and run BOLT or SAIGE for analysis if using LMMs. This is something to consider.
Are you running Hail with Spark in local mode (single machine, laptop or server) or on a Spark cluster? This may help us understand what’s going on a bit better. Debugging performance in this situation can be a bit tricky, for sure, and there are a couple explanations for why you’d be losing parallelism. I’d recommend taking your ld-pruned data, writing it, then reading it back before computing the RRM. Also, can you print
mt.n_partitions() and mt.count()` and post those here?
Thanks for the reply, and for flagging the limitations of Hail’s LMM. We are using it exactly because it’s easier to scale (as you mentioned), and it’s more convenient to include in our purely python-driven pipeline. But I take your point that we may want to consider using Hail just for QC and using BOLT/SAIGE downstream.
Regarding realized_relationship_matrix performance, we are running Hail on a Spark cluster, and have no issues with parallelization for any functions other than realized_relationship_matrix. In this particular case, 128 cores are available, the number of partitions is 395, and the number of variants post-LD pruning is 5,541,184. The workflow as currently implemented is to:
- Load the MT
- Perform ld_pruning
- Run realized_relationship_matrix
I tried saving an intermediate filtered MT after performing LD pruning but it did not make a difference (1 process was still used). I’ve attached a screenshot of the Spark UI job history, showing parallelization up until RRM runs.
I ran into the above issue with a test case of ~ 1,500 individuals, however when working with another data set of ~ 30,000 individuals with the same pipeline, I found that the number of processes used by realized_relationship_matrix was 49, so at least it is scaling up with more individuals, just not in a way I understand.
With the issue of scaling, I wonder if @danking has thoughts on this? He could perhaps shed some light !
By default Hail uses a block size of 4096. The RRM is an n_samples x n_samples matrix, represented in blocks of 4096x4096. For 1500 individuals that’s one block. The issue you’re likely facing is that to compute the 1500x1500 matrix, you have to take the inner product of a 1500x5541184 matrix with a 5541184x1500 matrix. We’re hard at work on converting this operation into a tree-aggregate-like operation and I expect improvements in a few months that will take advantage of parallelism in the inner-product step.
That said, I don’t think you need 5.5M variants to get a good estimate of the sample-sample correlation. Just yesterday, I ran kinship inference (a different but related problem to the RRM) on 16k samples and I only used ~60k SNVs. 5.5M seems like an overkill by about a factor of 100.
Thanks Dan, this answers my question. I’ll use fewer variants to speed up the process, and understand about the block size.