Realized_relationship_matrix performance

stecast · July 24, 2020, 12:36am

Hi Hail Team,
I’m using realized_relationship_matrix() to generate a pre-computed kinship matrix for use with linear_mixed_model() and have found that it’s quite slow. Under the hood, I see that it is only using a single process, which I imagine is a factor. Do you have any suggestions for improving performance? I assume more stringent LD pruning before running it would mean fewer variants going into the function, and would improve performance? Is this something you’d suggest doing? Are there any guidelines for the number of variants that is optimal to use? Finally, is it somehow faster if the genotypes are provided directly to linear_mixed_model() with the z_t argument vs pre-computing? Sorry for the flood of questions, and thanks in advance for any help.

Best,
Stephane

tpoterba · July 24, 2020, 8:53pm

OK, a couple of things. First, Hail’s linear mixed model is harder to use and certainly less efficient (though more scalable) than options like BOLT-LMM and SAIGE. Most Hail users at the Broad use Hail to slice/dice/QC their data, but export and run BOLT or SAIGE for analysis if using LMMs. This is something to consider.

Are you running Hail with Spark in local mode (single machine, laptop or server) or on a Spark cluster? This may help us understand what’s going on a bit better. Debugging performance in this situation can be a bit tricky, for sure, and there are a couple explanations for why you’d be losing parallelism. I’d recommend taking your ld-pruned data, writing it, then reading it back before computing the RRM. Also, can you print mt.n_partitions() and mt.count()` and post those here?

stecast · July 27, 2020, 7:33pm

Thanks for the reply, and for flagging the limitations of Hail’s LMM. We are using it exactly because it’s easier to scale (as you mentioned), and it’s more convenient to include in our purely python-driven pipeline. But I take your point that we may want to consider using Hail just for QC and using BOLT/SAIGE downstream.

Regarding realized_relationship_matrix performance, we are running Hail on a Spark cluster, and have no issues with parallelization for any functions other than realized_relationship_matrix. In this particular case, 128 cores are available, the number of partitions is 395, and the number of variants post-LD pruning is 5,541,184. The workflow as currently implemented is to:

Load the MT
Perform ld_pruning
Run realized_relationship_matrix

I tried saving an intermediate filtered MT after performing LD pruning but it did not make a difference (1 process was still used). I’ve attached a screenshot of the Spark UI job history, showing parallelization up until RRM runs.

I ran into the above issue with a test case of ~ 1,500 individuals, however when working with another data set of ~ 30,000 individuals with the same pipeline, I found that the number of processes used by realized_relationship_matrix was 49, so at least it is scaling up with more individuals, just not in a way I understand.

kumarveerapen · July 30, 2020, 3:32pm

With the issue of scaling, I wonder if @danking has thoughts on this? He could perhaps shed some light !

danking · July 30, 2020, 4:00pm

By default Hail uses a block size of 4096. The RRM is an n_samples x n_samples matrix, represented in blocks of 4096x4096. For 1500 individuals that’s one block. The issue you’re likely facing is that to compute the 1500x1500 matrix, you have to take the inner product of a 1500x5541184 matrix with a 5541184x1500 matrix. We’re hard at work on converting this operation into a tree-aggregate-like operation and I expect improvements in a few months that will take advantage of parallelism in the inner-product step.

That said, I don’t think you need 5.5M variants to get a good estimate of the sample-sample correlation. Just yesterday, I ran kinship inference (a different but related problem to the RRM) on 16k samples and I only used ~60k SNVs. 5.5M seems like an overkill by about a factor of 100.

stecast · July 31, 2020, 4:20pm

Thanks Dan, this answers my question. I’ll use fewer variants to speed up the process, and understand about the block size.

Topic		Replies	Views
[Breaking Change] Hail 0.2 removed KinshipMatrix, linear_mixed_regression Updates	17	1345	October 24, 2018
Hail curious potential user Q Help [0.1]	8	1610	March 7, 2017
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2188	February 8, 2020
Best way to check relatedness in large sample sets Help [0.1]	4	1398	October 9, 2018
Code review of lmm for multi phenotypes Hail Query & hailctl	21	1134	October 4, 2019

Realized_relationship_matrix performance

Related topics