Hi folks,
Is there any way to download locally some of the dataset used by load_dataset()
Hail function? For example, those such as GTEx_RNA_seq_gene_TPMs
and dbNSFP_variants
in Hail-friendly format to use in a local Spark cluster.
Regards!
Hi folks,
Is there any way to download locally some of the dataset used by load_dataset()
Hail function? For example, those such as GTEx_RNA_seq_gene_TPMs
and dbNSFP_variants
in Hail-friendly format to use in a local Spark cluster.
Regards!
You could download them using gsutil
, but you’ll pay egress costs. This Google documentation explains how to download files from “requester pays” buckets. The per-gigabyte cost to download this data depends on where in the world you are.
You can estimate the size of a dataset with gustil du -sh gs://...
. That operation has a small cost associated to it. dbNSFP is about 100 GB in size.
These are the URLs for dbNSFP_variants:
[
{
"reference_genome": "GRCh37",
"url": {
"aws": {
"us": "s3://hail-datasets-us-east-1/annotations/dbnsfp4.0a.GRCh37.ht"
},
"gcp": {
"eu": "gs://hail-datasets-eu/annotations/dbnsfp4.0a.GRCh37.ht",
"us": "gs://hail-datasets-us/annotations/dbnsfp4.0a.GRCh37.ht"
}
},
"version": "4.0"
},
{
"reference_genome": "GRCh38",
"url": {
"aws": {
"us": "s3://hail-datasets-us-east-1/annotations/dbnsfp4.0a.GRCh38.ht"
},
"gcp": {
"eu": "gs://hail-datasets-eu/annotations/dbnsfp4.0a.GRCh38.ht",
"us": "gs://hail-datasets-us/annotations/dbnsfp4.0a.GRCh38.ht"
}
},
"version": "4.0"
}
]
and GTEx:
[
{
"reference_genome": "GRCh37",
"url": {
"aws": {
"us": "s3://hail-datasets-us-east-1/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt"
},
"gcp": {
"eu": "gs://hail-datasets-eu/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt",
"us": "gs://hail-datasets-us/GTEx_RNA_seq_gene_TPMs.v7.GRCh37.mt"
}
},
"version": "v7"
}
]
In general, this information is stored here:
hl.experimental.db.DB().config
Hi @danking,
Many thanks for the info. I’ll have a look.