I'm encountering "Bucket is a requester pays bucket but no user project provided."

Previously, I could access requester pays buckets from my laptop and Google Dataproc clusters. On my laptop, I had to install the Google Cloud Platform hadoop connector. Unfortunately, this suddenly stopped working recently. How do I fix it?

Here’s an example of a failing script.

import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'AUTO',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'broad-ctsa'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

Hey! I’m sorry you’re having trouble with requester pays buckets.

Google Cloud Storage recently changed the error message reported when a bucket is requester pays. Unfortunately, the only way for a third-party to know if a Google Cloud Storage bucket is requester pays is to inspect this error message. When this error message changed, it broke several libraries (e.g. Terra/GATK, GoogleCloudPlatform/hadoop-connector) for interacting with requester pays buckets.

While we wait for Google to address the issue in the Hadoop-connector (on which Hail relies), you must manually specify the buckets. In the below example, you should replace requester-pays-bucket1,requester-pays-bucket2 with a comma-separated list of the requester-pays buckets you wish to use. You should also replace YOUR-PROJECT-ID-HERE with your Google Cloud Project’s id.

import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'requester-pays-bucket1,requester-pays-bucket2',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'YOUR-PROJECT-ID-HERE'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

Hi Dan, any updates on this?
I’m running into the same issue despite inputting a billing account that I can use gsutil ls to see the gs://ukb-diverse-pops-public directory.

I’m using a Jupyter notebook on Terra and it’s throwing

Error summary: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/ukb-diverse-pops-public/o/sumstats_release%2Fresults_full.mt?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Bucket is a requester pays bucket but no user project provided.",
    "reason" : "required"
  } ],
  "message" : "Bucket is a requester pays bucket but no user project provided."
}```

I've tried 

```import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'requester-pays-bucket1,requester-pays-bucket2',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'YOUR-PROJECT-ID-HERE'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

but it throws the same error

Hey @jfu !

Google has released the latest version of the connector: Due to a changed error message, AUTO mode for requester pays buckets does not work · Issue #736 · GoogleCloudDataproc/hadoop-connectors · GitHub . The fix is in version 2.2.6. According to Google Dataproc that’s available in clusters versions 2.0.37 and later. Do you know which version you’re using? If not, you should contact Terra and request that they update to the latest cluster version.

Regarding the workaround, you need to change what I posted to fit your needs. In particular, instead of requester-pays-bucket1,requester-pays-bucket2 you should have ukb-diverse-pops-public and instead of YOUR-PROJECT-ID-HERE you need to insert the Google cloud project id corresponding to your Terra project. You might need to contact Terra support to request the project id if that isn’t shown to you in the Terra UI.

EDIT: I’ve updated the solution to be a bit more clear about the need to make these changes to the code I include.

1 Like

Thanks Dan! Must be the default dataproc terra uses is not the right version - let me try and spin up my own :slight_smile:

Hi @danking! The version on Terra by default is still stuck at 2.2.3. Can confirm that spinning up my own dataproc, which contains 2.2.6 resolves the error :slight_smile: Hopefully Terra catches up soon to avoid having to spin up a dataproc and installing dependencies.

1 Like

I’m glad to hear you’re unblocked! May I ask what you mean by installing dependencies? Our intention is for hailctl dataproc start to automatically install everything you need to start a Dataproc cluster with Hail and a notebook.

Oh whoops - didn’t realize there was that option. I manually configured a Dataproc cluster and installed python, hail, etc. Ooops! Good to know for the future, thanks!

1 Like