Hi, I’d like to annotate variants based on amino acid coordinates, for example if I have the following table
AA_start
AA_end
Gene
Domain
179
191
SCN5A
Cytoplasmic
275
401
SCN2A
Pore-loop
I’d like to annotate variants with ‘Domain’, if the amino acid changed (e.g. va.vep.transcript_consequences.protein_start) falls within AA_start and AA_end and ‘Gene’ matches that in the variant annotation.
Suppose you have some vds containing VEP and gene annotations and a KeyTable named domain with the schema you depicted in the table.
# ensure domain is keyed appropriately
domain = domain.key_by(['AA_start', 'AA_end', 'gene']
# extract the variants from a VDS
variants = vds.variant_table()
domain_annotated_variants = (variants.annotate([
'AA_start = va.vep.transcript_consequences.protein_start',
'AA_end = va.vep.transcript_consequences.protein_end',
'gene = va.gene'])
# remove irrelevant data
.drop('va')
# use the same keys as the domain key table
.key_by(['AA_start', 'AA_end', 'gene'])
# cross-reference the two tables with join
.join(domain)
# drop irrelevant data again
.drop(['AA_start', 'AA_end', 'gene'])
)
Now the schema of domain_annotated_variants is: { v: Variant, domain: String }. It’s now easy to bring the domain annotation into the original VDS:
This extends your variant annotation schema with one field: domain of type String.
NB: I have assumed that there is exactly one domain for each amino-acid-coordinate and gene pair. If that’s not the case you can pass product = True to annotate_variants_table. This changes the type of the variant annotation domain to Array[String] and it now contains every matching domain, instead of an arbitrary matching domain.