Annotating variants that are 2 nucleotides apart from each other

Hello Hail community!

I am trying to find and annotate variants that are 2 base pairs away from each other and was wondering if anyone in the community has tried to do something similar.

Using the locus_windows function, I can identify variants that overlap in windows with a 2bp radius:

ht_sites = mt.rows()
window_res = hl.linalg.utils.locus_windows(ht_sites.locus, 2)[0]

This returns an array that looks like this: array([0, 0, 0, 3, 4, 5, 5, 7, 8, ...]) where repeated elements represent the index of variant sites that are within two basepairs of another. I could identify elements with repeated values, transform that into a boolean expression and annotate the rows of the matrixTable.

However, there are two issues with this approach:

  1. A locus with two or more non-ref alleles and not within 2 bp from another variant will be wrongly annotated.
  2. locus_windows returns a numpy array and I was wondering if there is a way to keep everything in hail structures to make full use of the distributed computing.

Thanks in advance for any insight!
Roberto

@patrick-schultz might have an idea about this?

1 Like