Sliding window approach to Spearman's rank


I’d like to run a Spearman’s rank correlation in Hail comparing gene expression information with genotype information. I know this is quite tricky given that the Spearman’s correlation needs to rank the data, and therefore is not easily parallelisable. For my case, one potential solution to this problem is using a sliding window approach, where I use a sliding window of 1Mb around each gene, then launch a job for each window. Is this sliding window approach something that can be done in Hail? And if so, are there any recommendations on the best way to do this?


Hi Katalina,

Interesting problem! Hail doesn’t have very good support for windowed operations currently, but it’s definitely on our roadmap. It might be possible to get something working in a hacky way, but it would likely be tricky.

There’s another approach that I think should work. We have an aggregator called approx_cdf (docs) that computes a small sketch of the distribution of field. From that sketch, you can estimate the median or other quantiles of the field, and you can go the other way, estimating the rank of a value. It doesn’t look like we ever added a function for going from value to rank, but I can write that and share it with you.

So in this approach you would first compute the approximate cdfs of the two fields in an aggregation pass. Then in a second pass, you would use those to estimate the ranks of each value, with which you could compute the rank correlation. Do you think that would work for your use case?

Could you elaborate a bit on what you want to do? Are the gene expression and genotype data entry fields? And if so, are you computing a single global correlation, or per row or column?