Hi!
We are trying to calculate the Pearson correlation / linkage disequilibrium between one (reference) variant and all variants in a genomic region. Both the reference variant and the region are from the same dataset.
Our strategy so far is to
- subset the full dataset to the reference variant,
- subset the full dataset to the region of interest,
- use
union_rows()
to merge above subsets, - use
hl.ld_matrix()
to calculate the LD matrix on the merged dataset (based onGT.n_alt_alleles()
), setting the LD radius to cover the maximum distance between the reference locus and the “farthest” end of the region (I, equivalently, also triedhl.row_correlation()
, with similar slow results), - from the resulting block matrix, select the relevant Pearson correlation values by accessing
[0, 1:]
, - and using
to_numpy()
to collect the values.
Now this kind of works, but is really, really slow - I imagine because of the full (N + 1) x (N + 1)
matrix (N
being the number of variants in the region) being calculated.
I also tried to use BlockMatrix.sparsify_rectangle()
to set everything but the first row to zero, but it still seems to me as if Hail isn’t lazy enough to actually only calculate required / collected LD values.
What would be a better and faster way to go about this? I tried something along the lines of Iterate over rows - #4 by orr, but am stuck, I think, at annotating entries in the region subset with the GT.n_alt_alleles()
of the one-variant subset - unlike in the linked example, the “shapes” and keys don’t match.
Any help would be greatly appreciated! I’m happy to provide any additional information that might be useful for trying to help us out.