How do I count LoFs per gene?

danking · March 2, 2020, 7:36pm

Hi,

I want to calculate the number of LoF variants per gene! I have a MatrixTable already annotated with VEP. The VEP consequences that I consider loss of function are:

'start_lost', 'splice_acceptor_variant', 'splice_donor_variant',
'stop_gained', 'frameshift_variant'

In particular, I am having trouble:

going from a variant-keyed dataset to a gene-keyed dataset.
not double counting a variant if it appears in two transcripts of the same gene!

danking · March 2, 2020, 7:47pm

Ah, yes, this is a bit tricky! When going from variant-keyed data to gene-keyed data, we first need to duplicate each row once (and only once) for each gene in which it appears. Unfortunately, the VEP transcript array has an element for each transcript in which a variant appears. If a transcript appears in two variants in gene A and three variants in gene B, we’ll double and triple count it, respectively!

I’ve included the full code at the end of this post, but I’ll step through it piecewise.

First we define a set of the LoF terms.

lof_terms = hl.set(['start_lost', 'splice_acceptor_variant',
                    'splice_donor_variant', 'stop_gained',
                    'frameshift_variant'])

None of the questions you’ve asked reference the genotypes, so let’s ignore everything other than the row (variant) metadata:

variants = mt.rows()

For each variant, create a dictionary whose keys are the genes and whose values are the list of transcripts in that gene in which this variant appears:

variants = variants.annotate(
   transcript_consequences_by_gene =
     variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)

Create a variant annotation that is the set of all genes in which this variant appears and duplicate the variant once per gene. The name='gene' bit means: after duplicating the rows, remove the gene_set field and create a new filed called gene which is just the gene for this row.

variants = variants.annotate(
   gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')

Recall that each row corresponds to a variant in a particular gene. We look up the list of transcripts for this variant in this gene and save that as a new annotation called transcripts_for_just_this_gene.

variants = variants.annotate(
   transcripts_for_just_this_gene =
     variants.vep.transcript_consequences_by_gene(variants.gene)
)

Check if any transcript has any consequence term that means LoF:

variants = variants.annotate(
    is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
            transcript.consequence_terms.any(lambda term:
               lof_terms.contains(term)
        )
    )
)

Create one row per gene by counting the number of LoF variants in that gene.

variants = variants.group_by(
    variants.gene
).aggregate(
    number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)

Show the first couple rows:

variants.is_lof.show()

import hail as hl

mt = hl.read_matrix_table('gs://...')
lof_terms = hl.set(['start_lost', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant'])

variants = mt.rows()

variants = variants.annotate(
   transcript_consequences_by_gene =
     variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)
variants = variants.annotate(
   gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')
variants = variants.annotate(
   transcripts_for_just_this_gene =
     variants.vep.transcript_consequences_by_gene(variants.gene)
)
variants = variants.annotate(
    is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
            transcript.consequence_terms.any(lambda term:
            lof_terms.contains(term)
        )
    )
)

variants1 = variants.group_by(
    variants.gene
).aggregate(
    number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)

variants.is_lof.show()

Topic		Replies	Views
Calculating the number of variants per column category Hail Query & hailctl	10	658	November 17, 2020
How can one get a list (set) of unique values from an annotation Help [0.1]	4	816	October 12, 2017
Most severe VEP consequence Hail Query & hailctl	3	4	July 3, 2025
Annotate variants with genes Hail Query & hailctl	9	482	May 22, 2023
Annotating nested structs based on the struct field value Hail Query & hailctl	0	309	August 21, 2023

How do I count LoFs per gene?

Related topics