I am trying to find and annotate variants that are 2 base pairs away from each other and was wondering if anyone in the community has tried to do something similar.
Using the locus_windows function, I can identify variants that overlap in windows with a 2bp radius:
This returns an array that looks like this: array([0, 0, 0, 3, 4, 5, 5, 7, 8, ...]) where repeated elements represent the index of variant sites that are within two basepairs of another. I could identify elements with repeated values, transform that into a boolean expression and annotate the rows of the matrixTable.
However, there are two issues with this approach:
A locus with two or more non-ref alleles and not within 2 bp from another variant will be wrongly annotated.
locus_windows returns a numpy array and I was wondering if there is a way to keep everything in hail structures to make full use of the distributed computing.