Using annotation dataset locally

Hi folks,

Is there any way to download locally some of the dataset used by load_dataset() Hail function? For example, those such as GTEx_RNA_seq_gene_TPMs and dbNSFP_variants in Hail-friendly format to use in a local Spark cluster.


You could download them using gsutil, but you’ll pay egress costs. This Google documentation explains how to download files from “requester pays” buckets. The per-gigabyte cost to download this data depends on where in the world you are.

You can estimate the size of a dataset with gustil du -sh gs://.... That operation has a small cost associated to it. dbNSFP is about 100 GB in size.

These are the URLs for dbNSFP_variants:

        "reference_genome": "GRCh37",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/annotations/"
          "gcp": {
            "eu": "gs://hail-datasets-eu/annotations/",
            "us": "gs://hail-datasets-us/annotations/"
        "version": "4.0"
        "reference_genome": "GRCh38",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/annotations/"
          "gcp": {
            "eu": "gs://hail-datasets-eu/annotations/",
            "us": "gs://hail-datasets-us/annotations/"
        "version": "4.0"

and GTEx:

        "reference_genome": "GRCh37",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/"
          "gcp": {
            "eu": "gs://hail-datasets-eu/",
            "us": "gs://hail-datasets-us/"
        "version": "v7"

In general, this information is stored here:


Hi @danking,

Many thanks for the info. I’ll have a look.