THE HAIL ANNOTATION DATABASE
The hail annotation database
provides an easy way to obtain publicly-available biological information to annotate your .vds. We currently include epigenetic annotations, gnomAD, predicted coding variant consequences, conservation scores, machine-learning-based deleteriousness scores, gene-specific annotations and will soon include results from genome-wide association studies.
Documentation and query builder
Here (https://hail.is/hail/annotationdb.html) you can find not only the documentation for each annotation but also a query builder. The query builder allows to interactively select the annotations that you want to include and return the exact script that you need to use in hail.
Bear in mind that we currently support only multi-allelic-split .vds, so run split_multi()
before running the annotation data.
Data location
All the annotation material is stored on the cloud: gs://annotationdb
and therefore the only way to use the database is to run hail on the cloud. You can check this blog post for more info: Using Hail on the Google Cloud Platform. If you prefer to work via Jupiter notebook, then check this blog post: Using Hail with Jupyter Notebooks on Google Cloud.
Each folder in gs://annotationdb
contains a different annotation set. Typically, it contains a .tsv.bgz file with raw data, a .kt (or .vds) file that is directly used for annotations, a .json file that contains the documentation about the annotation and .py with the code used to transform the .tsv.bgz file into the processed .kt (or .vds).
Annotation types
There are four main categories of annotations:
- Pre-computed variants: annotations that are pre-computed for each possible SNP in the genome. Those are directly annotated from a .vds. We currently do not pre-compute values for INDELs and therefore they will not be annotated.
- On-the-fly variants: annotations that are obtained by running on-the-fly an annotation program (we support VEP and Nirvana). Both SNPs and INDELs will be annotated.
- Intervals: annotations that are specific to certain genomic intervals. Those are directly annotated from a KeyTable (.kt) and include most of the epigenomic annotations. Both SNPs and INDELs will be annotated.
- Gene: annotations that are specific to each gene. Gene-annotations relies on having a gene annotation already in the .vds to annotate. If this is not the case, VEP is run to obtain this information.
Requesting new annotations
We are planning to create an online form for users to suggest which annotations to include in this database. Until then, please contact Liam Labbot - labbott@broadinstitute.org or Andrea Ganna - aganna@broadinstitute.org to request new annotations.