I’d like to run a Spearman’s rank correlation in Hail comparing gene expression information with genotype information. I know this is quite tricky given that the Spearman’s correlation needs to rank the data, and therefore is not easily parallelisable. For my case, one potential solution to this problem is using a sliding window approach, where I use a sliding window of 1Mb around each gene, then launch a job for each window. Is this sliding window approach something that can be done in Hail? And if so, are there any recommendations on the best way to do this?
Interesting problem! Hail doesn’t have very good support for windowed operations currently, but it’s definitely on our roadmap. It might be possible to get something working in a hacky way, but it would likely be tricky.
There’s another approach that I think should work. We have an aggregator called
approx_cdf (docs) that computes a small sketch of the distribution of field. From that sketch, you can estimate the median or other quantiles of the field, and you can go the other way, estimating the rank of a value. It doesn’t look like we ever added a function for going from value to rank, but I can write that and share it with you.
So in this approach you would first compute the approximate cdfs of the two fields in an aggregation pass. Then in a second pass, you would use those to estimate the ranks of each value, with which you could compute the rank correlation. Do you think that would work for your use case?
Could you elaborate a bit on what you want to do? Are the gene expression and genotype data entry fields? And if so, are you computing a single global correlation, or per row or column?
Hi Patrick, that’s a super helpful response! At the moment the gene expression and genotype fields are just stored in a table, and I’d like to compute the correlation on the columns of expression values and genotypes. However I ended up finding a way to do this in Hail batch by sending out a job for each gene with a 1Mb window around it. I think using
approx_cdf would be handy in the future if I’m trying to find the rank of values, and I’ll be on the lookout for any new implementations of windowed operations
Glad you found a way to do what you needed!
For future reference, here’s how I would do this using
ht is a Hail table with fields
bar, and we want to compute the rank correlation.
# First compute the cdfs of the two fields.
# `_localize=False` just prevents sending the result to Python and back.
# The `k` parameter controls the accuracy/sketch size tradeoff);
# I've increased it from the default 100 to 1000 because we're only computing two,
# so we can afford larger sketches.
cdfs = ht.aggregate(
# Store the results in global fields
ht = ht.annotate_globals(**agg)
# Use the cdfs to compute approximate ranks
ht = ht.annotate(
Now you can use the ranks to compute correlation.
The last step uses a function
rank_from_cdf, which I will PR to include in Hail, but in the meantime here is the implementation:
def rank_from_cdf(cdf, value):
"""Estimates rank of a value, given the results of an :func:`.approx_cdf` aggregator
cdf : :class:`.StructExpression`
Result of :func:`.approx_cdf` aggregator
value : :class:`.Expression`
rank = _lower_bound(cdf.values, value)
return hl.rbind(cdf, compute)
This is fantastic- thanks Patrick! I just tested it out on the table I referenced above and was able to get the ranks using the code you provided (after loading the dependencies for the
rank_from_cdf function from here). This will be very useful for me in the future, and I’m sure for others. Thanks so much for taking the time to answer my question and for implementing the new function into Hail!