Ah, yes, this is a bit tricky! When going from variant-keyed data to gene-keyed data, we first need to duplicate each row once (and only once) for each gene in which it appears. Unfortunately, the VEP transcript array has an element for each transcript in which a variant appears. If a transcript appears in two variants in gene A and three variants in gene B, we’ll double and triple count it, respectively!
I’ve included the full code at the end of this post, but I’ll step through it piecewise.
First we define a set of the LoF terms.
lof_terms = hl.set(['start_lost', 'splice_acceptor_variant',
'splice_donor_variant', 'stop_gained',
'frameshift_variant'])
None of the questions you’ve asked reference the genotypes, so let’s ignore everything other than the row (variant) metadata:
variants = mt.rows()
For each variant, create a dictionary whose keys are the genes and whose values are the list of transcripts in that gene in which this variant appears:
variants = variants.annotate(
transcript_consequences_by_gene =
variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)
Create a variant annotation that is the set of all genes in which this variant appears and duplicate the variant once per gene. The name='gene'
bit means: after duplicating the rows, remove the gene_set
field and create a new filed called gene
which is just the gene for this row.
variants = variants.annotate(
gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')
Recall that each row corresponds to a variant in a particular gene. We look up the list of transcripts for this variant in this gene and save that as a new annotation called transcripts_for_just_this_gene
.
variants = variants.annotate(
transcripts_for_just_this_gene =
variants.vep.transcript_consequences_by_gene(variants.gene)
)
Check if any transcript has any consequence term that means LoF:
variants = variants.annotate(
is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
transcript.consequence_terms.any(lambda term:
lof_terms.contains(term)
)
)
)
Create one row per gene by counting the number of LoF variants in that gene.
variants = variants.group_by(
variants.gene
).aggregate(
number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)
Show the first couple rows:
variants.is_lof.show()
import hail as hl
mt = hl.read_matrix_table('gs://...')
lof_terms = hl.set(['start_lost', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant'])
variants = mt.rows()
variants = variants.annotate(
transcript_consequences_by_gene =
variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)
variants = variants.annotate(
gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')
variants = variants.annotate(
transcripts_for_just_this_gene =
variants.vep.transcript_consequences_by_gene(variants.gene)
)
variants = variants.annotate(
is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
transcript.consequence_terms.any(lambda term:
lof_terms.contains(term)
)
)
)
variants1 = variants.group_by(
variants.gene
).aggregate(
number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)
variants.is_lof.show()