Error in load_dataset()

Hi Hail team,

I’ve encounter a issue using the following line to load the dataset in hail

mt = hl.experimental.load_dataset(name=‘dbSNP_rsid’,
reference_genome=‘GRCh38’,
version=“154”,
region=‘us’,
cloud=‘gcp’)

The error message is :
Hail version: 0.2.126-ee77707f4fab
Error summary: HailException: No file or directory found at gs://hail-datasets-us/dbSNP/build_154/GRCh38/full_table.ht

I can load other hail dataset without any issue, such as the “gnomad_hgdp_1kg_subset_dense” data.

Could you please help with some suggestions regarding this issue? Thanks!

Best,
Wen

Hi @Wen_He, please see this post. Unfortunately due to Google pricing changes we had to move our datasets to regional buckets. Upgrading to 0.2.128 should fix the issue.

Thanks for the prompt response, @danielgoldstein Dan!

I’m currently working within the AOU workspace environment and encountered restrictions preventing me from upgrading HAIL to version 0.2.128 on my end. Do you have any suggestions for alternative methods to access the dataset? Your input would be greatly appreciated. Many thanks!!

Hi @Wen_He, as a temporary measure, we can fix the URL that you’re trying to access and read it directly through hail’s normal methods. The only change that we made is moving the hail-datasets-us bucket to hail-datasets-us-central1. Can you try the following?

ht = hl.read_table('gs://hail-datasets-us-central1/dbSNP/build_154/GRCh38/full_table.ht')

I’ll add that since this is a regional bucket, it is imperative that you create your cluster in us-central1 to avoid expensive egress charges.

DISCLAIMER FOR FUTURE READERS: The google storage URIs above are not part of the officially supported hail API. Upgrading your hail version is still the recommended approach if possible.

I’m experiencing the exact same issue, but I can’t seem to find a way to launch a dataproc cluster using the latest version of Hail as is suggested above. I have Hail v0.2.130 installed on my local machine that I use to start a dataproc cluster. But then inside that cluster Hail v0.2.120 is being installed. If I run pip install --upgrade hail either in a jupyter notebook or after I ssh into the main worker on the cluster, for some reason it is still installing Hail v0.2.130

Hail version: 0.2.120-f00f916faf78
Error summary: HailException: No file or directory found at gs://hail-datasets-us/dbSNP/build_154/GRCh37/full_table.ht

Hi @Austin_Argentieri, sorry you’re running into issues! This sounds like a python environment problem, could you provide some diagnostic information to help us figure out what’s going on? In particular,

  1. The exact command you use to start the dataproc cluster
  2. The output of
which pip
pip show hail
which python
which python3
python --version
python3 --version

Hey @danielgoldstein. Thanks for your willingness to help out and for the helpful suggestion that it might be a python problem. It looks like I had python3.8 running in my venv and Hail v0.2.130 requires python > 3.9. I’ve created a new venv with python 3.10 and the problem is fixed!