Calculate LD of one variant against all variants in a region

Hi!

We are trying to calculate the Pearson correlation / linkage disequilibrium between one (reference) variant and all variants in a genomic region. Both the reference variant and the region are from the same dataset.
Our strategy so far is to

  1. subset the full dataset to the reference variant,
  2. subset the full dataset to the region of interest,
  3. use union_rows() to merge above subsets,
  4. use hl.ld_matrix() to calculate the LD matrix on the merged dataset (based on GT.n_alt_alleles()), setting the LD radius to cover the maximum distance between the reference locus and the “farthest” end of the region (I, equivalently, also tried hl.row_correlation(), with similar slow results),
  5. from the resulting block matrix, select the relevant Pearson correlation values by accessing [0, 1:],
  6. and using to_numpy() to collect the values.

Now this kind of works, but is really, really slow - I imagine because of the full (N + 1) x (N + 1) matrix (N being the number of variants in the region) being calculated.
I also tried to use BlockMatrix.sparsify_rectangle() to set everything but the first row to zero, but it still seems to me as if Hail isn’t lazy enough to actually only calculate required / collected LD values.

What would be a better and faster way to go about this? I tried something along the lines of Iterate over rows - #4 by orr, but am stuck, I think, at annotating entries in the region subset with the GT.n_alt_alleles() of the one-variant subset - unlike in the linked example, the “shapes” and keys don’t match.

Any help would be greatly appreciated! I’m happy to provide any additional information that might be useful for trying to help us out.

How about something like:


def mean_center_and_variance_normalize_call(m):
    m = m.annotate_rows(cs=hl.agg.stats(m.GT.n_alt_alleles()))
    return m.annotate_entries(x=(m.GT.n_alt_alleles() - cs.mean) / cs.stdev)

mt_1var = mt.filter_rows(mt.locus == THE_ONE_REFERENCE_LOCUS)
mt_1var = mean_center_and_variance_normalize_call(mt_1var)
mt_1var = mt_1var.key_rows_by().select_rows().select_cols().entries().key_by('s')

mt_region = hl.filter_intervals(mt, REGION_YOU_WANT))
mt_region = mt_region.annotate_cols(reference_x = mt_1var[mt_region.s].x)
mt_region = mean_center_and_variance_normalize_call(mt_region)
mt_region = mt_region.annotate_rows(hl.agg.corr(mt_region.x, mt_region.reference_x))

Thanks a lot, @tpoterba! That works like a charm :slightly_smiling_face: Never would have I figured that out myself.
I replaced the last line,
mt_region = mt_region.annotate_rows(hl.agg.corr(mt_region.x, mt_region.reference_x))
with
mt_region = mt_region.annotate_rows(pearson_correlation=hl.agg.corr(mt_region.x, mt_region.reference_x)),
though - I would think the “field name” is necessary here.

yep, always forget something when I’m writing code in a text box instead of an IDE :slight_smile:

@tpoterba this has been working really great for us, but now we are hitting issues with datasets that have filtered out GT entries (they show as NA in mt.GT.show()). It seems like what you propose sometimes yields a NaN correlation in that case, while, e.g., hl.row_correlation() yields a non-NaN float value. I haven’t quite figured out yet the exact conditions that have to apply for a NaN value to occur, but I will continue to investigate. In the meantime, I was wondering whether you have any idea what might lead to that behavior? I would have guessed that, as per the documentation, any filtered-out entries are ignored for subsequent analyses.

I created a minimal example of this behavior. To reproduce, one would have to download this tiny matrix table with two rows only, adapt the path in the following script and run it:

import hail as hl

mt = hl.read_matrix_table("/home/simeon/Downloads/ld_nan_test.mt")

# shows non-zero LD value value
hl.ld_matrix(mt.GT.n_alt_alleles(), mt.locus, 1e6).to_numpy()

def mean_center_and_variance_normalize_call(m):
    m = m.annotate_rows(cs=hl.agg.stats(m.GT.n_alt_alleles()))
    return m.annotate_entries(x=(m.GT.n_alt_alleles() - m.cs.mean) / m.cs.stdev)

mt_region = mt    
mt_1var = mt.filter_rows(mt.locus.position == 10382)
mt_1var = mean_center_and_variance_normalize_call(mt_1var)
mt_1var = mt_1var.key_rows_by().select_rows().select_cols().entries().key_by('s')
mt_region = mt_region.annotate_cols(reference_x = mt_1var[mt_region.s].x)
mt_region = mean_center_and_variance_normalize_call(mt_region)
mt_region = mt_region.annotate_rows(r=hl.agg.corr(mt_region.x, mt_region.reference_x))

# shows NaN LD value
mt_region.r.show()

I am at a loss as to why this occurs and not only non-filtered out entries are taken into account.

Silly question but just to confirm: there are absolutely no NaNs in the source rows that produce a NaN correlation?

If there are no non-missing pairs per row, I’d expect a NaN, I think.

add to the aggregation:

n_defined = hl.agg.count_where(hl.is_defined(mt_region.x & mt_region.reference_x))

Thanks for coming back to me so quickly! It seems like there are indeed no NaNs in the source rows, as evidenced by

In [1]: set(hl.is_nan(mt_region.x).collect())
Out[1]: {False}

In [2]: set(hl.is_nan(mt_region.reference_x).collect())
Out[2]: {False, None}

I am not sure what None means for hl.is_nan(), though. Does it mean “not defined”? Because here’s the result of @tpoterba’s aggregation:

In [3]: mt_region.annotate_rows(n_defined=hl.agg.count_where(hl.is_defined(mt_region.x) & hl.is_defined(mt_region.reference_x))).n_defined.show()
+---------------+------------+-----------+
| locus         | alleles    | n_defined |
+---------------+------------+-----------+
| locus<GRCh38> | array<str> |     int64 |
+---------------+------------+-----------+
| chr10:10382   | ["A","T"]  |       298 |
| chr10:92475   | ["A","G"]  |       298 |
+---------------+------------+-----------+

In [4]: mt_region.count_cols()
Out[4]: 755

So some of the values we work with are indeed missing (which escaped my attention because hl.is_missing(mt_region.GT).collect() is all False). While they should get ignored downstream (right?), it’s odd why there are missing values for x / reference_x and not for GT :thinking: during standardization, there are no zero standard deviations that might lead to NaN / undefined values. I’ll investigate further on Friday.

None means that there was at least one mt_region.reference_x which was missing (hl.is_missing(...) was true). Missing, NaN, and filtered are three distinct concepts in Hail. NaN is only relevant to floating point numbers and should only appear when you divide by zero or perform other ill-defined operations.

For most operations, missing or NA propagates through them as it does in SQL.

For completeness:

import hail as hl
import numpy as np

assert hl.eval(hl.is_nan(3.0)) == False
assert hl.eval(hl.is_nan(np.nan)) == True
assert hl.eval(hl.is_nan(hl.missing(hl.tfloat))) == None

assert hl.eval(hl.is_missing(3.0)) == False
assert hl.eval(hl.is_missing(np.nan)) == False
assert hl.eval(hl.is_missing(hl.missing(hl.tfloat))) == True

assert hl.eval(hl.missing(hl.tfloat) + 3.0) == None

This little example is also instructive, I think:

mt = hl.utils.range_matrix_table(2, 2)
mt = mt.annotate_entries(x = hl.case()
    .when(hl.all(mt.row_idx == 0, mt.col_idx == 0), 0.0)
    .when(hl.all(mt.row_idx == 1, mt.col_idx == 0), np.nan)
    .when(hl.all(mt.row_idx == 0, mt.col_idx == 1), hl.missing(hl.tfloat))
    .default(1.0)
)
mt.show()
+---------+----------+----------+
| row_idx |      0.x |      1.x |
+---------+----------+----------+
|   int32 |  float64 |  float64 |
+---------+----------+----------+
|       0 | 0.00e+00 |       NA |
|       1 |      NaN | 1.00e+00 |
+---------+----------+----------+

We can produce filtered entries with filter_entries. NOTE: When the condition evaluates to missing, that is treated as False aka remove the entry. Also notice that we, somewhat confusingly display filtered entries as NAs:

mt = mt.filter_entries(mt.x != 1.0)
mt.show()
+---------+----------+---------+
| row_idx |      0.x |     1.x |
+---------+----------+---------+
|   int32 |  float64 | float64 |
+---------+----------+---------+
|       0 | 0.00e+00 |      NA |
|       1 |      NaN |      NA |
+---------+----------+---------+

Notice that filtered entries are excluded from the output:

mt.x.collect()
[0.0, nan]

If you wanted to preserve missing values rather than convert them to filtered, do this:

mt = hl.utils.range_matrix_table(2, 2)
mt = mt.annotate_entries(x = hl.case()
    .when(hl.all(mt.row_idx == 0, mt.col_idx == 0), 0.0)
    .when(hl.all(mt.row_idx == 1, mt.col_idx == 0), np.nan)
    .when(hl.all(mt.row_idx == 0, mt.col_idx == 1), hl.missing(hl.tfloat))
    .default(1.0)
)
mt = mt.filter_entries(hl.coalesce(mt.x != 1.0, True))
mt.show()
+---------+----------+---------+
| row_idx |      0.x |     1.x |
+---------+----------+---------+
|   int32 |  float64 | float64 |
+---------+----------+---------+
|       0 | 0.00e+00 |      NA |
|       1 |      NaN |      NA |
+---------+----------+---------+

It doesn’t look different, but it is different:

mt.x.collect()
[0.0, None, nan]

Thanks a lot @danking for that excellent explanation!

Unfortunately, even with increased understanding of missingness, NaN, and filtered, I haven’t progressed much further.
I managed to filter down above toy dataset such that, for both mt_region.x and mt_region.reference_x,

  • set(hl.is_missing(mt_region.{x, reference_x}).collect()) == {False},
  • set(hl.is_defined(mt_region.{x, reference_x}).collect()) == {True},
  • set(hl.is_nan(mt_region.{x, reference_x}).collect()) == {False}.

So I would say that nothing suspicious is left anymore, but I still get an NaN LD value. Would any of you have some more ideas what could be going on here?

I’m fairly certain you have filtered entries:

In [9]: mt = hl.utils.range_matrix_table(2, 2)
   ...: mt = mt.annotate_entries(x = mt.col_idx)
   ...: mt.show()
   ...: mt = mt.filter_entries(hl.all(mt.row_idx == 1, mt.col_idx == 1), keep=False)
   ...: mt.show()
   ...: mt = mt.annotate_rows(x_corr = hl.agg.corr(mt.x, mt.col_idx))
   ...: mt.x_corr.show()
+---------+-------+-------+
| row_idx |   0.x |   1.x |
+---------+-------+-------+
|   int32 | int32 | int32 |
+---------+-------+-------+
|       0 |     0 |     1 |
|       1 |     0 |     1 |
+---------+-------+-------+
+---------+-------+-------+
| row_idx |   0.x |   1.x |
+---------+-------+-------+
|   int32 | int32 | int32 |
+---------+-------+-------+
|       0 |     0 |     1 |
|       1 |     0 |    NA |
+---------+-------+-------+
+---------+----------+
| row_idx |   x_corr |
+---------+----------+
|   int32 |  float64 |
+---------+----------+
|       0 | 1.00e+00 |
|       1 |      NaN |
+---------+----------+

You can get entry filter stats like this:

mt.compute_entry_filter_stats().entry_stats_row.show()

I’m not sure what the right thing for Hail to do in this case is. What should happen if you try to correlate two vectors but one of the vectors has a different length from the other? That’s what happens above: [0] is compared to [0, 1].

Now, I think you should be able to use unfilter_entries to fix this, but I think we might have a bug in unfilter_entries because this doesn’t work as I expect.

In [24]: mt = hl.utils.range_matrix_table(2, 2)
    ...: mt = mt.annotate_entries(x = mt.col_idx)
    ...: mt = mt.filter_entries(hl.all(mt.row_idx == 1, mt.col_idx == 1), keep=False)
    ...: mt = mt.unfilter_entries()
    ...: mt = mt.annotate_rows(mean_x = hl.agg.mean(mt.x))
    ...: mt = mt.annotate_entries(x = hl.coalesce(mt.x, mt.mean_x))
    ...: mt = mt.annotate_rows(x_corr = hl.agg.corr(mt.x, mt.col_idx))
    ...: mt.x.show()
    ...: mt.x_corr.show()
+---------+----------+----------+
| row_idx |      0.x |      1.x |
+---------+----------+----------+
|   int32 |  float64 |  float64 |
+---------+----------+----------+
|       0 | 0.00e+00 | 1.00e+00 |
|       1 | 0.00e+00 | 0.00e+00 |
+---------+----------+----------+
+---------+----------+
| row_idx |   x_corr |
+---------+----------+
|   int32 |  float64 |
+---------+----------+
|       0 | 1.00e+00 |
|       1 |      NaN |
+---------+----------+

Unfilter entries is meant to convert filtered entries into missing entries.

Ah, wait, no, the Pearson correlation coefficient is undefined in my example above: the variance in the x’s is zero: they’re all zero. If I replace the mean with 1.0, it all works fine.

@simeoncarstens do you perhaps have monomorphic rows?

Indeed, I have filtered-out entries! Good catch :slightly_smiling_face:
While mt.unfilter_entries() might indeed make it work, I prefer not to do that; that way I guess I would be reversing some of the QC that has been performed on that data before.

As for monomorphic rows, no, I don’t think that’s the case:

In [1]: mt_region.cs.stdev.collect()
Out[1]: [0.057831167582597605, 0.03639369425467438]

In [2]: set(mt_region.filter_rows(mt_region.locus.position == 92475).GT.collect())
Out[2]: {Call(alleles=[0, 0], phased=False), Call(alleles=[0, 1], phased=False)}

In [3]: set(mt_region.filter_rows(mt_region.locus.position == 10382).GT.collect())
Out[3]: {Call(alleles=[0, 0], phased=False), Call(alleles=[0, 1], phased=False)}

But you are definitely right that one would expect different filtered-out entries in the rows to pose a problem. Is there a way to ask Hail to take only those entries into account that “match up”, meaning which are not filtered out in both rows when calculating the correlation?
The kind of reverse question would then be: what does hl.row_correlation() do in that case so that it yields a non-NaN result? Is this result trustworthy?