Annotate variants using amino acid position intervals


#1

Hi, I’d like to annotate variants based on amino acid coordinates, for example if I have the following table

AA_start AA_end Gene Domain
179 191 SCN5A Cytoplasmic
275 401 SCN2A Pore-loop

I’d like to annotate variants with ‘Domain’, if the amino acid changed (e.g. va.vep.transcript_consequences.protein_start) falls within AA_start and AA_end and ‘Gene’ matches that in the variant annotation.

If you need any more info, please let me know!

Cheers,
Elliott


#2

You can achieve this with a KeyTable.

Suppose you have some vds containing VEP and gene annotations and a KeyTable named domain with the schema you depicted in the table.

# ensure domain is keyed appropriately
domain = domain.key_by(['AA_start', 'AA_end', 'gene']
# extract the variants from a VDS
variants = vds.variant_table()

domain_annotated_variants = (variants.annotate([
  'AA_start = va.vep.transcript_consequences.protein_start',
  'AA_end = va.vep.transcript_consequences.protein_end',
  'gene = va.gene']) 
  # remove irrelevant data
  .drop('va')
  # use the same keys as the domain key table
  .key_by(['AA_start', 'AA_end', 'gene'])
  # cross-reference the two tables with join
  .join(domain)
  # drop irrelevant data again
  .drop(['AA_start', 'AA_end', 'gene'])
)

Now the schema of domain_annotated_variants is: { v: Variant, domain: String }. It’s now easy to bring the domain annotation into the original VDS:

domain_annotated_vds = vds.annotate_variants_table(domain_annotated_variants.key_by('v'))

This extends your variant annotation schema with one field: domain of type String.

NB: I have assumed that there is exactly one domain for each amino-acid-coordinate and gene pair. If that’s not the case you can pass product = True to annotate_variants_table. This changes the type of the variant annotation domain to Array[String] and it now contains every matching domain, instead of an arbitrary matching domain.