Sliding window approach to Spearman's rank

KatalinaBobowik · October 14, 2021, 3:31am

Hi,

I’d like to run a Spearman’s rank correlation in Hail comparing gene expression information with genotype information. I know this is quite tricky given that the Spearman’s correlation needs to rank the data, and therefore is not easily parallelisable. For my case, one potential solution to this problem is using a sliding window approach, where I use a sliding window of 1Mb around each gene, then launch a job for each window. Is this sliding window approach something that can be done in Hail? And if so, are there any recommendations on the best way to do this?

Thanks!

patrick-schultz · October 14, 2021, 4:05pm

Hi Katalina,

Interesting problem! Hail doesn’t have very good support for windowed operations currently, but it’s definitely on our roadmap. It might be possible to get something working in a hacky way, but it would likely be tricky.

There’s another approach that I think should work. We have an aggregator called approx_cdf (docs) that computes a small sketch of the distribution of field. From that sketch, you can estimate the median or other quantiles of the field, and you can go the other way, estimating the rank of a value. It doesn’t look like we ever added a function for going from value to rank, but I can write that and share it with you.

So in this approach you would first compute the approximate cdfs of the two fields in an aggregation pass. Then in a second pass, you would use those to estimate the ranks of each value, with which you could compute the rank correlation. Do you think that would work for your use case?

patrick-schultz · October 14, 2021, 5:12pm

Could you elaborate a bit on what you want to do? Are the gene expression and genotype data entry fields? And if so, are you computing a single global correlation, or per row or column?

KatalinaBobowik · October 20, 2021, 5:56am

Hi Patrick, that’s a super helpful response! At the moment the gene expression and genotype fields are just stored in a table, and I’d like to compute the correlation on the columns of expression values and genotypes. However I ended up finding a way to do this in Hail batch by sending out a job for each gene with a 1Mb window around it. I think using approx_cdf would be handy in the future if I’m trying to find the rank of values, and I’ll be on the lookout for any new implementations of windowed operations

patrick-schultz · October 20, 2021, 1:32pm

Glad you found a way to do what you needed!

For future reference, here’s how I would do this using approx_cdf. Say ht is a Hail table with fields foo and bar, and we want to compute the rank correlation.

# First compute the cdfs of the two fields.
#  `_localize=False` just prevents sending the result to Python and back.
# The `k` parameter controls the accuracy/sketch size tradeoff);
# I've increased it from the default 100 to 1000 because we're only computing two,
# so we can afford larger sketches.
cdfs = ht.aggregate(
    hl.struct(
        foo_cdf=hl.agg.approx_cdf(ht.foo, k=1000),
        bar_cdf=hl.agg.approx_cdf(ht.bar, k=1000)),
    _localize=False)
# Store the results in global fields
ht = ht.annotate_globals(**agg)
# Use the cdfs to compute approximate ranks
ht = ht.annotate(
    foo_rank=rank_from_cdf(ht.foo_cdf, ht.foo),
    bar_rank=rank_from_cdf(ht.bar_cdf, ht.bar))

Now you can use the ranks to compute correlation.

The last step uses a function rank_from_cdf, which I will PR to include in Hail, but in the meantime here is the implementation:

def rank_from_cdf(cdf, value):
    """Estimates rank of a value, given the results of an :func:`.approx_cdf` aggregator

    Parameters
    ----------
    cdf : :class:`.StructExpression`
        Result of :func:`.approx_cdf` aggregator
    value : :class:`.Expression`

    Returns
    -------
    :class:`.NumericExpression`
        Estimated rank
    """
    def compute(cdf):
        rank = _lower_bound(cdf.values, value)
        return cdf.ranks[rank]
    return hl.rbind(cdf, compute)

KatalinaBobowik · October 20, 2021, 11:57pm

This is fantastic- thanks Patrick! I just tested it out on the table I referenced above and was able to get the ranks using the code you provided (after loading the dependencies for the rank_from_cdf function from here). This will be very useful for me in the future, and I’m sure for others. Thanks so much for taking the time to answer my question and for implementing the new function into Hail!

Topic		Replies	Views
Hail suitability for trans-eQTL Hail Query & hailctl	1	244	May 1, 2023
Visualization and analytics frontend Hail Query & hailctl	2	649	June 6, 2020
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	355	May 15, 2023
Sparsifying LD matrix with arbitrary windows Hail Query & hailctl	11	352	January 12, 2023
[Experimental] Population Aware Relatedness Estimation Updates	0	1521	August 12, 2017

Sliding window approach to Spearman's rank

Related topics