Calculate LD of one variant against all variants in a region

Hi!

We are trying to calculate the Pearson correlation / linkage disequilibrium between one (reference) variant and all variants in a genomic region. Both the reference variant and the region are from the same dataset.
Our strategy so far is to

  1. subset the full dataset to the reference variant,
  2. subset the full dataset to the region of interest,
  3. use union_rows() to merge above subsets,
  4. use hl.ld_matrix() to calculate the LD matrix on the merged dataset (based on GT.n_alt_alleles()), setting the LD radius to cover the maximum distance between the reference locus and the “farthest” end of the region (I, equivalently, also tried hl.row_correlation(), with similar slow results),
  5. from the resulting block matrix, select the relevant Pearson correlation values by accessing [0, 1:],
  6. and using to_numpy() to collect the values.

Now this kind of works, but is really, really slow - I imagine because of the full (N + 1) x (N + 1) matrix (N being the number of variants in the region) being calculated.
I also tried to use BlockMatrix.sparsify_rectangle() to set everything but the first row to zero, but it still seems to me as if Hail isn’t lazy enough to actually only calculate required / collected LD values.

What would be a better and faster way to go about this? I tried something along the lines of Iterate over rows - #4 by orr, but am stuck, I think, at annotating entries in the region subset with the GT.n_alt_alleles() of the one-variant subset - unlike in the linked example, the “shapes” and keys don’t match.

Any help would be greatly appreciated! I’m happy to provide any additional information that might be useful for trying to help us out.

How about something like:


def mean_center_and_variance_normalize_call(m):
    m = m.annotate_rows(cs=hl.agg.stats(m.GT.n_alt_alleles()))
    return m.annotate_entries(x=(m.GT.n_alt_alleles() - cs.mean) / cs.stdev)

mt_1var = mt.filter_rows(mt.locus == THE_ONE_REFERENCE_LOCUS)
mt_1var = mean_center_and_variance_normalize_call(mt_1var)
mt_1var = mt_1var.key_rows_by().select_rows().select_cols().entries().key_by('s')

mt_region = hl.filter_intervals(mt, REGION_YOU_WANT))
mt_region = mt_region.annotate_cols(reference_x = mt_1var[mt_region.s].x)
mt_region = mean_center_and_variance_normalize_call(mt_region)
mt_region = mt_region.annotate_rows(hl.agg.corr(mt_region.x, mt_region.reference_x))

Thanks a lot, @tpoterba! That works like a charm :slightly_smiling_face: Never would have I figured that out myself.
I replaced the last line,
mt_region = mt_region.annotate_rows(hl.agg.corr(mt_region.x, mt_region.reference_x))
with
mt_region = mt_region.annotate_rows(pearson_correlation=hl.agg.corr(mt_region.x, mt_region.reference_x)),
though - I would think the “field name” is necessary here.

yep, always forget something when I’m writing code in a text box instead of an IDE :slight_smile: