How do I count LoFs per gene?

Hi,

I want to calculate the number of LoF variants per gene! I have a MatrixTable already annotated with VEP. The VEP consequences that I consider loss of function are:

'start_lost', 'splice_acceptor_variant', 'splice_donor_variant',
'stop_gained', 'frameshift_variant'

In particular, I am having trouble:

  1. going from a variant-keyed dataset to a gene-keyed dataset.
  2. not double counting a variant if it appears in two transcripts of the same gene!

Ah, yes, this is a bit tricky! When going from variant-keyed data to gene-keyed data, we first need to duplicate each row once (and only once) for each gene in which it appears. Unfortunately, the VEP transcript array has an element for each transcript in which a variant appears. If a transcript appears in two variants in gene A and three variants in gene B, we’ll double and triple count it, respectively!

I’ve included the full code at the end of this post, but I’ll step through it piecewise.

First we define a set of the LoF terms.

lof_terms = hl.set(['start_lost', 'splice_acceptor_variant',
                    'splice_donor_variant', 'stop_gained',
                    'frameshift_variant'])

None of the questions you’ve asked reference the genotypes, so let’s ignore everything other than the row (variant) metadata:

variants = mt.rows()

For each variant, create a dictionary whose keys are the genes and whose values are the list of transcripts in that gene in which this variant appears:

variants = variants.annotate(
   transcript_consequences_by_gene =
     variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)

Create a variant annotation that is the set of all genes in which this variant appears and duplicate the variant once per gene. The name='gene' bit means: after duplicating the rows, remove the gene_set field and create a new filed called gene which is just the gene for this row.

variants = variants.annotate(
   gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')

Recall that each row corresponds to a variant in a particular gene. We look up the list of transcripts for this variant in this gene and save that as a new annotation called transcripts_for_just_this_gene.

variants = variants.annotate(
   transcripts_for_just_this_gene =
     variants.vep.transcript_consequences_by_gene(variants.gene)
)

Check if any transcript has any consequence term that means LoF:

variants = variants.annotate(
    is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
            transcript.consequence_terms.any(lambda term:
               lof_terms.contains(term)
        )
    )
)

Create one row per gene by counting the number of LoF variants in that gene.

variants = variants.group_by(
    variants.gene
).aggregate(
    number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)

Show the first couple rows:

variants.is_lof.show()
import hail as hl

mt = hl.read_matrix_table('gs://...')
lof_terms = hl.set(['start_lost', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant'])

variants = mt.rows()

variants = variants.annotate(
   transcript_consequences_by_gene =
     variants.vep.transcript_consequences.group_by(lambda x: x.gene_symbol)
)
variants = variants.annotate(
   gene_set = variants.transcript_consequences_by_gene.key_set()
)
variants = variants.explode(variants.gene_set, name='gene')
variants = variants.annotate(
   transcripts_for_just_this_gene =
     variants.vep.transcript_consequences_by_gene(variants.gene)
)
variants = variants.annotate(
    is_lof = variants.transcripts_for_just_this_gene.any(lambda transcript:
            transcript.consequence_terms.any(lambda term:
            lof_terms.contains(term)
        )
    )
)

variants1 = variants.group_by(
    variants.gene
).aggregate(
    number_of_non_ref_variants = hl.agg.count_where(variants.is_lof)
)

variants.is_lof.show()
2 Likes