Annotating variants that are 2 nucleotides apart from each other

Hello Hail community!

I am trying to find and annotate variants that are 2 base pairs away from each other and was wondering if anyone in the community has tried to do something similar.

Using the locus_windows function, I can identify variants that overlap in windows with a 2bp radius:

ht_sites = mt.rows()
window_res = hl.linalg.utils.locus_windows(ht_sites.locus, 2)[0]

This returns an array that looks like this: array([0, 0, 0, 3, 4, 5, 5, 7, 8, ...]) where repeated elements represent the index of variant sites that are within two basepairs of another. I could identify elements with repeated values, transform that into a boolean expression and annotate the rows of the matrixTable.

However, there are two issues with this approach:

  1. A locus with two or more non-ref alleles and not within 2 bp from another variant will be wrongly annotated.
  2. locus_windows returns a numpy array and I was wondering if there is a way to keep everything in hail structures to make full use of the distributed computing.

Thanks in advance for any insight!
Roberto

@patrick-schultz might have an idea about this?

1 Like

Hi everyone, for anyone still looking for an answer, this is a belated update on a solution that @mhebrard came up with:

# Load genotypes
mt_g = hl.read_matrix_table('genotypes.mt')

# Keep SNPs only
mt_snp = mt_g.filter_rows(
    (hl.is_snp(mt_g.alleles[0], mt_g.alleles[1]))
)

# Keep Alt genotypes only
mt_alt = mt_snp.filter_entries(
    hl.is_defined(mt_snp.GT) & mt_snp.GT.is_non_ref()
)

# write alt SNP mt
mt_alt.write("genotypes.altSNP.mt")


# Read alt SNP mt
mt_alt = hl.read_matrix_table('genotypes.altSNP.mt')

# Read mt_key
mt_key = mt_alt

# Key by locus only
mt_key = mt_key.key_rows_by('locus')

# Annotate MNP +/- 2bp
# Note that we use mt_key for the lookup but annotate mt_alt
# It is required to work on 2 different tables for the predicate to work
mt_mnp = mt_alt.annotate_entries(
    gt_is_mnp = ( 
        hl.is_defined(mt_key[
            hl.locus(mt_alt.locus.contig, mt_alt.locus.position+1, reference_genome='GRCh38'),
            mt_alt.col_key].GT
        )
    ) | ( 
        hl.is_defined(mt_key[
            hl.locus(mt_alt.locus.contig, mt_alt.locus.position+2, reference_genome='GRCh38'),
            mt_alt.col_key].GT
        )
    ) | (
        hl.is_defined(mt_key[
            hl.locus(mt_alt.locus.contig, mt_alt.locus.position-1, reference_genome='GRCh38'),
            mt_alt.col_key].GT
        )
    ) | ( 
        hl.is_defined(mt_key[
            hl.locus(mt_alt.locus.contig, mt_alt.locus.position-2, reference_genome='GRCh38'),
            mt_alt.col_key].GT
        )
    )
)

# Annotate row
mt_anno = mt_mnp.annotate_rows(
    site_is_mnp = hl.agg.any(mt_mnp.gt_is_mnp),
    n_in_mnv = hl.agg.count_where(mt_mnp.gt_is_mnp)
)

# write the row hailTable with mnp annotation
ht_mnp = mt_anno.rows()
ht_mnp.write("mnp_sites.ht")


1 Like