Previously, I could access requester pays buckets from my laptop and Google Dataproc clusters. On my laptop, I had to install the Google Cloud Platform hadoop connector. Unfortunately, this suddenly stopped working recently. How do I fix it?
Hey! I’m sorry you’re having trouble with requester pays buckets.
Google Cloud Storage recently changed the error message reported when a bucket is requester pays. Unfortunately, the only way for a third-party to know if a Google Cloud Storage bucket is requester pays is to inspect this error message. When this error message changed, it broke several libraries (e.g. Terra/GATK, GoogleCloudPlatform/hadoop-connector) for interacting with requester pays buckets.
While we wait for Google to address the issue in the Hadoop-connector (on which Hail relies), you must manually specify the buckets. In the below example, you should replace requester-pays-bucket1,requester-pays-bucket2 with a comma-separated list of the requester-pays buckets you wish to use. You should also replace YOUR-PROJECT-ID-HERE with your Google Cloud Project’s id.
Hi Dan, any updates on this?
I’m running into the same issue despite inputting a billing account that I can use gsutil ls to see the gs://ukb-diverse-pops-public directory.
I’m using a Jupyter notebook on Terra and it’s throwing
Error summary: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/ukb-diverse-pops-public/o/sumstats_release%2Fresults_full.mt?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Bucket is a requester pays bucket but no user project provided.",
"reason" : "required"
} ],
"message" : "Bucket is a requester pays bucket but no user project provided."
}```
I've tried
```import hail as hl
hl.init(spark_conf={
'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
'spark.hadoop.fs.gs.requester.pays.buckets': 'requester-pays-bucket1,requester-pays-bucket2',
'spark.hadoop.fs.gs.requester.pays.project.id': 'YOUR-PROJECT-ID-HERE'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()
Regarding the workaround, you need to change what I posted to fit your needs. In particular, instead of requester-pays-bucket1,requester-pays-bucket2 you should have ukb-diverse-pops-public and instead of YOUR-PROJECT-ID-HERE you need to insert the Google cloud project id corresponding to your Terra project. You might need to contact Terra support to request the project id if that isn’t shown to you in the Terra UI.
EDIT: I’ve updated the solution to be a bit more clear about the need to make these changes to the code I include.
Hi @danking! The version on Terra by default is still stuck at 2.2.3. Can confirm that spinning up my own dataproc, which contains 2.2.6 resolves the error Hopefully Terra catches up soon to avoid having to spin up a dataproc and installing dependencies.
I’m glad to hear you’re unblocked! May I ask what you mean by installing dependencies? Our intention is for hailctl dataproc start to automatically install everything you need to start a Dataproc cluster with Hail and a notebook.
Oh whoops - didn’t realize there was that option. I manually configured a Dataproc cluster and installed python, hail, etc. Ooops! Good to know for the future, thanks!