Using annotation dataset locally

Hi folks,

Is there any way to download locally some of the dataset used by load_dataset() Hail function? For example, those such as GTEx_RNA_seq_gene_TPMs and dbNSFP_variants in Hail-friendly format to use in a local Spark cluster.

Regards!

You could download them using gsutil, but you’ll pay egress costs. This Google documentation explains how to download files from “requester pays” buckets. The per-gigabyte cost to download this data depends on where in the world you are.

You can estimate the size of a dataset with gustil du -sh gs://.... That operation has a small cost associated to it. dbNSFP is about 100 GB in size.

These are the URLs for dbNSFP_variants:

[
      {
        "reference_genome": "GRCh37",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/annotations/dbnsfp4.0a.GRCh37.ht"
          },
          "gcp": {
            "eu": "gs://hail-datasets-eu/annotations/dbnsfp4.0a.GRCh37.ht",
            "us": "gs://hail-datasets-us/annotations/dbnsfp4.0a.GRCh37.ht"
          }
        },
        "version": "4.0"
      },
      {
        "reference_genome": "GRCh38",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/annotations/dbnsfp4.0a.GRCh38.ht"
          },
          "gcp": {
            "eu": "gs://hail-datasets-eu/annotations/dbnsfp4.0a.GRCh38.ht",
            "us": "gs://hail-datasets-us/annotations/dbnsfp4.0a.GRCh38.ht"
          }
        },
        "version": "4.0"
      }
    ]

and GTEx:

[
      {
        "reference_genome": "GRCh37",
        "url": {
          "aws": {
            "us": "s3://hail-datasets-us-east-1/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt"
          },
          "gcp": {
            "eu": "gs://hail-datasets-eu/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt",
            "us": "gs://hail-datasets-us/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt"
          }
        },
        "version": "v7"
      }
    ]

In general, this information is stored here:

hl.experimental.db.DB().config

Hi @danking,

Many thanks for the info. I’ll have a look.