The annotation database

THE HAIL ANNOTATION DATABASE

The hail annotation database provides an easy way to obtain publicly-available biological information to annotate your .vds. We currently include epigenetic annotations, gnomAD, predicted coding variant consequences, conservation scores, machine-learning-based deleteriousness scores, gene-specific annotations and will soon include results from genome-wide association studies.

Documentation and query builder

Here (https://hail.is/hail/annotationdb.html) you can find not only the documentation for each annotation but also a query builder. The query builder allows to interactively select the annotations that you want to include and return the exact script that you need to use in hail.

Bear in mind that we currently support only multi-allelic-split .vds, so run split_multi() before running the annotation data.

Data location

All the annotation material is stored on the cloud: gs://annotationdb and therefore the only way to use the database is to run hail on the cloud. You can check this blog post for more info: Using Hail on the Google Cloud Platform. If you prefer to work via Jupiter notebook, then check this blog post: Using Hail with Jupyter Notebooks on Google Cloud.

Each folder in gs://annotationdb contains a different annotation set. Typically, it contains a .tsv.bgz file with raw data, a .kt (or .vds) file that is directly used for annotations, a .json file that contains the documentation about the annotation and .py with the code used to transform the .tsv.bgz file into the processed .kt (or .vds).

Annotation types

There are four main categories of annotations:

  • Pre-computed variants: annotations that are pre-computed for each possible SNP in the genome. Those are directly annotated from a .vds. We currently do not pre-compute values for INDELs and therefore they will not be annotated.
  • On-the-fly variants: annotations that are obtained by running on-the-fly an annotation program (we support VEP and Nirvana). Both SNPs and INDELs will be annotated.
  • Intervals: annotations that are specific to certain genomic intervals. Those are directly annotated from a KeyTable (.kt) and include most of the epigenomic annotations. Both SNPs and INDELs will be annotated.
  • Gene: annotations that are specific to each gene. Gene-annotations relies on having a gene annotation already in the .vds to annotate. If this is not the case, VEP is run to obtain this information.

Requesting new annotations

We are planning to create an online form for users to suggest which annotations to include in this database. Until then, please contact Liam Labbot - labbott@broadinstitute.org or Andrea Ganna - aganna@broadinstitute.org to request new annotations.

1 Like

Hi there,

I would like to annotate our VCFs with the updated gnomad PLI scores, but I am not sure if I will be able to get access to a Google account with billing capabilities (re: the blog post on Using Hail on Google Cloud). I also read the post on Using Hail with Jupyter Notebooks on Google Cloud, but I am on a Windows machine, so it does not look like this would work for me (?). Are there any future plans to provide these annotation capabilities off of the Google Cloud Platform?

Thanks!
-Parkes

Hi Parkes,

Currently, the annotation database is only available on Google Cloud Platform. In the future, we intend to port it over to other cloud platforms, such as Amazon Web Services, and perhaps also provide an offline version, but we don’t have a timeline for that yet.

Liam

Hi Liam,

Thanks for the response. Is there anyway to pull down just the PLI scores so that we can use our own scripts to annotate our VCFs with them?

Thanks!

Best,
Parkes