Annotating variants with EVE prediction

Paul_74 · June 16, 2022, 7:26pm

I am trying to annotate my variants with EVE (https://evemodel.org/), a new evolution-based prediction. Just checking with other users, have any of you gotten success with annotating with it? I have downloaded the bulk data and there seems to be a lot of VCFs (for each gene), and I am not too sure how to proceed.

danking · June 17, 2022, 2:37pm

Hey @Paul_74 ,

I’ve never used EVE before, but I’d expect that you should be able to load all the vcfs like this:

import hail as hl
mt = hl.read_matrix_table('/path/to/your/data.mt')
eve = hl.import_vcf('/path/to/eve/*.vcf')

Then you’ll need a gene annotation. If you’re in one of our supported clouds (Google & Microsoft) you should be able to run this:

db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'gencode')

Alternatively you can download the gencode dataset itself and import it into Hail.

Then you can use the gene annotation to load the eve data:

mt = mt.annotate_rows(eve = eve[mt.gencode.gene_name])

You might need to look at the other fields in mt.gencode until you find the format / kind of gene name that EVE is using.

paul_74_2 · June 17, 2022, 3:36pm

Hi @danking , I am the OP of this post but I forgot my password. I tried to do that, but it is giving me this error.
Error summary: HailException: Invalid locus '20:63361097' found. Position '63361097' is not within the range [1-63025520] for reference genome 'GRCh37'.
I understand that it may instead be GRCh38, but is the behavior to detect the reference genome version directly from the VCF? If so, do you know how to resolve or lift it over? Thank you.

danking · June 17, 2022, 3:56pm

Hail doesn’t try to detect reference genomes in VCFs, but you can explicitly specify it as:

hl.import_vcf(..., reference_genome='GRCh38')

I think you’ll also encounter the problem that GRCh38 decided to go with “chr” prefixes but a lot of folks elide those. You can fix those with contig_recoding:

hl.import_vcf(
    ...,
    contig_recoding = {
        **{str(i): 'chr' + str(i) for i in range(1, 23)},
        **{'X': 'chrX', 'Y': 'chrY', 'MT': 'chrM'}
    },
    reference_genome='GRCh38'
)

Is this error coming from importing the EVE data? That suggests that the EVE data is keyed by locus & alleles, not gene name, which means my suggestion above won’t work as expected. Can you share eve.describe() and mt.describe()?

paul_74_2 · June 17, 2022, 4:01pm

I tried your method, and I faced another issue:

Error summary: HailException: Invalid locus 'CHR_HG2217_PATCH:119029357' found. Contig 'CHR_HG2217_PATCH' is not in the reference genome 'GRCh38'.

Indeed, EVE is keyed by locus and allele:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        EVE: float64, 
        EnsTranscript: str, 
        RevStr: str, 
        ProtMut: str, 
        Class10: str, 
        Class20: str, 
        Class25: str, 
        Class30: str, 
        Class40: str, 
        Class50: str, 
        Class60: str, 
        Class70: str, 
        Class75: str, 
        Class80: str, 
        Class90: str
    }
----------------------------------------
Entry fields:
    None
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

If this is the case, is it wiser to just pick the individual genes and annotate it? Thank you for your help; I am not too familiar with annotating with an external data set.

danking · June 17, 2022, 4:17pm

Funky. I have no idea which chromosome that’s supposed to be. hl.import_vcf accepts skip_invalid_loci=True to just ignore invalid loci.

Cool, that makes it even easier to annotate with EVE:

import hail as hl
mt = hl.read_matrix_table(...)
eve = hl.import_vcf(...)
mt = mt.annotate_rows(eve = eve.rows()[mt.row_key])

For your genotype data, you can work directly from VCF if you like, but you’ll find that iterative quality control and analysis of that data is faster coming directly from a Hail MatrixTable. You can make a copy in MT format like this:

import hail as hl
hl.import_vcf(....).write('/new/path/to/data.mt')

EDIT: Add missing .rows() to the annotate_rows line.

paul_74_2 · June 17, 2022, 4:54pm

Thank you so much @danking , it seems to be working fine so far!

paul_74_2 · June 17, 2022, 5:12pm

Sorry for another message. I managed to write the EVE MatrixTable down, and I am trying to annotate the rows now. However, it is showing me this error:

TypeError: MatrixTable.__getitem__: invalid index argument(s)
  Usage 1: field selection: mt['field']
  Usage 2: Entry joining: mt[mt2.row_key, mt2.col_key]

  To join row or column fields, use one of the following:
    rows:
       mt.index_rows(mt2.row_key)
       mt.rows().index(mt2.row_key)
       mt.rows()[mt2.row_key]
    cols:
       mt.index_cols(mt2.col_key)
       mt.cols().index(mt2.col_key)
       mt.cols()[mt2.col_key]

I think I may know why, the alleles from EVE are showed like this:

chr1:961387	["GCC","AAA"]
chr1:961387	["GCC","AAC"]
chr1:961387	["GCC","AAG"]
chr1:961387	["GCC","AAT"]
chr1:961387	["GCC","ACA"]
chr1:961387	["GCC","ACC"]
chr1:961387	["GCC","ACG"]
chr1:961387	["GCC","ACT"]
chr1:961387	["GCC","AGA"]
chr1:961387	["GCC","AGC"]
chr1:961387	["GCC","AGG"]
chr1:961387	["GCC","AGT"]
chr1:961387	["GCC","ATA"]
chr1:961387	["GCC","ATC"]

If this is the case, it seems not to be possible to do so through the keys… Maybe I will just export it out and find a way to annotate it with brute force.

danking · June 17, 2022, 5:28pm

Ah, sorry. I forgot to index into the rows rather than the whole MT. Hail defaults to importing a VCF as a matrix table (i.e. a thing with genotypes) but your VCF has no samples. Hail is complaining that you’re trying to index the genotypes (a thing indexed by variants and samples) with just the variants. Try this:

mt = mt.annotate_rows(eve = eve.rows()[mt.row_key])

paul_74_2 · June 17, 2022, 5:38pm

Thanks that works! However, with how EVE structured their alleles:

chr1:961387	["GCC","ACT"]
chr1:961387	["GCC","AGA"]
chr1:961387	["GCC","AGC"]
chr1:961387	["GCC","AGG"]
chr1:961387	["GCC","AGT"]
chr1:961387	["GCC","ATA"]
chr1:961387	["GCC","ATC"]

I am not able to annotate my rows as my alleles (from GATK) are showed differently like this:

|chr1:17378|[C,T]|
|chr1:17379|[G,A]|
|chr1:17385|[G,A]|
|chr1:17398|[C,A]|

I may try to explore other methods, but thank you so much for your help!

jsmadsen · June 20, 2022, 11:41pm

My solution goes something like this:


mt = hl.vep(mt, "file:GRCh38_VEP.json")

mt = mt.annotate_rows(
    protCons=mt.vep.transcript_consequences.amino_acids[0].split("/")[0]
    + hl.str(mt.vep.transcript_consequences.protein_end[0])
    + mt.vep.transcript_consequences.amino_acids[0].split("/")[-1]
)

eve = pd.read_csv("eve.csv")
eve.loc[:, "protCons"] = eve.wt_aa + eve.position.astype(str) + eve.mt_aa 

### Put your dataframe in Hail

mt = mt.annotate_rows(EVE = eve[mt.protCons])

A little convoluted but easier than going through locus/alleles (Also, if someone has an easier way to protein consequences that would be neat).

paul_74_2 · June 22, 2022, 3:30pm

Thank you so much for the workaround! Can I ask where do you obtain the master eve.csv? Did you combine all of the variant CSV files, or did the team provide a master file that I may have missed out on?

Topic		Replies	Views
Annotate variants with genes Hail Query & hailctl	9	484	May 22, 2023
Best strategy for annotating and filtering VCF files using HAIL-VEP on UKB RAP? Hail Query & hailctl	3	1513	June 17, 2022
Import existing VEP annotations from vcf or CSQ Hail Query & hailctl	10	1521	November 27, 2019
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	455	December 5, 2023
Running VEP in Hail with a specific version of GENCODE Hail Query & hailctl	7	477	March 31, 2023

Annotating variants with EVE prediction

Related topics