Hello Hail community!
I am trying to find and annotate variants that are 2 base pairs away from each other and was wondering if anyone in the community has tried to do something similar.
locus_windows function, I can identify variants that overlap in windows with a 2bp radius:
ht_sites = mt.rows() window_res = hl.linalg.utils.locus_windows(ht_sites.locus, 2)
This returns an array that looks like this:
array([0, 0, 0, 3, 4, 5, 5, 7, 8, ...]) where repeated elements represent the index of variant sites that are within two basepairs of another. I could identify elements with repeated values, transform that into a boolean expression and annotate the rows of the matrixTable.
However, there are two issues with this approach:
- A locus with two or more non-ref alleles and not within 2 bp from another variant will be wrongly annotated.
locus_windowsreturns a numpy array and I was wondering if there is a way to keep everything in hail structures to make full use of the distributed computing.
Thanks in advance for any insight!