I am trying to annotate my variants with EVE (https://evemodel.org/), a new evolution-based prediction. Just checking with other users, have any of you gotten success with annotating with it? I have downloaded the bulk data and there seems to be a lot of VCFs (for each gene), and I am not too sure how to proceed.
Hey @Paul_74 ,
I’ve never used EVE before, but I’d expect that you should be able to load all the vcfs like this:
import hail as hl
mt = hl.read_matrix_table('/path/to/your/data.mt')
eve = hl.import_vcf('/path/to/eve/*.vcf')
Then you’ll need a gene annotation. If you’re in one of our supported clouds (Google & Microsoft) you should be able to run this:
db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'gencode')
Alternatively you can download the gencode dataset itself and import it into Hail.
Then you can use the gene annotation to load the eve data:
mt = mt.annotate_rows(eve = eve[mt.gencode.gene_name])
You might need to look at the other fields in mt.gencode
until you find the format / kind of gene name that EVE is using.
Hi @danking , I am the OP of this post but I forgot my password. I tried to do that, but it is giving me this error.
Error summary: HailException: Invalid locus '20:63361097' found. Position '63361097' is not within the range [1-63025520] for reference genome 'GRCh37'.
I understand that it may instead be GRCh38, but is the behavior to detect the reference genome version directly from the VCF? If so, do you know how to resolve or lift it over? Thank you.
Hail doesn’t try to detect reference genomes in VCFs, but you can explicitly specify it as:
hl.import_vcf(..., reference_genome='GRCh38')
I think you’ll also encounter the problem that GRCh38 decided to go with “chr” prefixes but a lot of folks elide those. You can fix those with contig_recoding
:
hl.import_vcf(
...,
contig_recoding = {
**{str(i): 'chr' + str(i) for i in range(1, 23)},
**{'X': 'chrX', 'Y': 'chrY', 'MT': 'chrM'}
},
reference_genome='GRCh38'
)
Is this error coming from importing the EVE data? That suggests that the EVE data is keyed by locus & alleles, not gene name, which means my suggestion above won’t work as expected. Can you share eve.describe()
and mt.describe()
?
I tried your method, and I faced another issue:
Error summary: HailException: Invalid locus 'CHR_HG2217_PATCH:119029357' found. Contig 'CHR_HG2217_PATCH' is not in the reference genome 'GRCh38'.
Indeed, EVE is keyed by locus and allele:
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh38>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
EVE: float64,
EnsTranscript: str,
RevStr: str,
ProtMut: str,
Class10: str,
Class20: str,
Class25: str,
Class30: str,
Class40: str,
Class50: str,
Class60: str,
Class70: str,
Class75: str,
Class80: str,
Class90: str
}
----------------------------------------
Entry fields:
None
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
If this is the case, is it wiser to just pick the individual genes and annotate it? Thank you for your help; I am not too familiar with annotating with an external data set.
Funky. I have no idea which chromosome that’s supposed to be. hl.import_vcf
accepts skip_invalid_loci=True
to just ignore invalid loci.
Cool, that makes it even easier to annotate with EVE:
import hail as hl
mt = hl.read_matrix_table(...)
eve = hl.import_vcf(...)
mt = mt.annotate_rows(eve = eve.rows()[mt.row_key])
For your genotype data, you can work directly from VCF if you like, but you’ll find that iterative quality control and analysis of that data is faster coming directly from a Hail MatrixTable. You can make a copy in MT format like this:
import hail as hl
hl.import_vcf(....).write('/new/path/to/data.mt')
EDIT: Add missing .rows()
to the annotate_rows
line.
Thank you so much @danking , it seems to be working fine so far!
Sorry for another message. I managed to write the EVE MatrixTable down, and I am trying to annotate the rows now. However, it is showing me this error:
TypeError: MatrixTable.__getitem__: invalid index argument(s)
Usage 1: field selection: mt['field']
Usage 2: Entry joining: mt[mt2.row_key, mt2.col_key]
To join row or column fields, use one of the following:
rows:
mt.index_rows(mt2.row_key)
mt.rows().index(mt2.row_key)
mt.rows()[mt2.row_key]
cols:
mt.index_cols(mt2.col_key)
mt.cols().index(mt2.col_key)
mt.cols()[mt2.col_key]
I think I may know why, the alleles from EVE are showed like this:
chr1:961387 ["GCC","AAA"]
chr1:961387 ["GCC","AAC"]
chr1:961387 ["GCC","AAG"]
chr1:961387 ["GCC","AAT"]
chr1:961387 ["GCC","ACA"]
chr1:961387 ["GCC","ACC"]
chr1:961387 ["GCC","ACG"]
chr1:961387 ["GCC","ACT"]
chr1:961387 ["GCC","AGA"]
chr1:961387 ["GCC","AGC"]
chr1:961387 ["GCC","AGG"]
chr1:961387 ["GCC","AGT"]
chr1:961387 ["GCC","ATA"]
chr1:961387 ["GCC","ATC"]
If this is the case, it seems not to be possible to do so through the keys… Maybe I will just export it out and find a way to annotate it with brute force.
Ah, sorry. I forgot to index into the rows rather than the whole MT. Hail defaults to importing a VCF as a matrix table (i.e. a thing with genotypes) but your VCF has no samples. Hail is complaining that you’re trying to index the genotypes (a thing indexed by variants and samples) with just the variants. Try this:
mt = mt.annotate_rows(eve = eve.rows()[mt.row_key])
Thanks that works! However, with how EVE structured their alleles:
chr1:961387 ["GCC","ACT"]
chr1:961387 ["GCC","AGA"]
chr1:961387 ["GCC","AGC"]
chr1:961387 ["GCC","AGG"]
chr1:961387 ["GCC","AGT"]
chr1:961387 ["GCC","ATA"]
chr1:961387 ["GCC","ATC"]
I am not able to annotate my rows as my alleles (from GATK) are showed differently like this:
|chr1:17378|[C,T]|
|chr1:17379|[G,A]|
|chr1:17385|[G,A]|
|chr1:17398|[C,A]|
I may try to explore other methods, but thank you so much for your help!
My solution goes something like this:
mt = hl.vep(mt, "file:GRCh38_VEP.json")
mt = mt.annotate_rows(
protCons=mt.vep.transcript_consequences.amino_acids[0].split("/")[0]
+ hl.str(mt.vep.transcript_consequences.protein_end[0])
+ mt.vep.transcript_consequences.amino_acids[0].split("/")[-1]
)
eve = pd.read_csv("eve.csv")
eve.loc[:, "protCons"] = eve.wt_aa + eve.position.astype(str) + eve.mt_aa
### Put your dataframe in Hail
mt = mt.annotate_rows(EVE = eve[mt.protCons])
A little convoluted but easier than going through locus/alleles
(Also, if someone has an easier way to protein consequences that would be neat).
Thank you so much for the workaround! Can I ask where do you obtain the master eve.csv
? Did you combine all of the variant CSV files, or did the team provide a master file that I may have missed out on?