I'm encountering "Bucket is a requester pays bucket but no user project provided."

Previously, I could access requester pays buckets from my laptop and Google Dataproc clusters. On my laptop, I had to install the Google Cloud Platform hadoop connector. Unfortunately, this suddenly stopped working recently. How do I fix it?

Here’s an example of a failing script.

import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'AUTO',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'broad-ctsa'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

READ THIS FIRST

This problem has been resolved. If you are getting an error about requester pays today, the cluster is almost certainly not configured for requester pays. hl.init only works if the Spark cluster is not already running, for example, when running locally.

If you’re using Google Dataproc, you should specify one of these parameters to configuration your cluster for Requester Pays usage (the following information is available from hailctl dataproc start --help):

  --requester-pays-allow-all
                        Allow reading from all requester-pays buckets.
  --requester-pays-allow-buckets REQUESTER_PAYS_ALLOW_BUCKETS
                        Comma-separated list of requester-pays buckets to
                        allow reading from.
  --requester-pays-allow-annotation-db
                        Allows reading from any of the requester-pays buckets
                        that hold data for the annotation database.



Hey! I’m sorry you’re having trouble with requester pays buckets.

Google Cloud Storage recently changed the error message reported when a bucket is requester pays. Unfortunately, the only way for a third-party to know if a Google Cloud Storage bucket is requester pays is to inspect this error message. When this error message changed, it broke several libraries (e.g. Terra/GATK, GoogleCloudPlatform/hadoop-connector) for interacting with requester pays buckets.

While we wait for Google to address the issue in the Hadoop-connector (on which Hail relies), you must manually specify the buckets. In the below example, you should replace requester-pays-bucket1,requester-pays-bucket2 with a comma-separated list of the requester-pays buckets you wish to use. You should also replace YOUR-PROJECT-ID-HERE with your Google Cloud Project’s id.

import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'requester-pays-bucket1,requester-pays-bucket2',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'YOUR-PROJECT-ID-HERE'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

Hi Dan, any updates on this?
I’m running into the same issue despite inputting a billing account that I can use gsutil ls to see the gs://ukb-diverse-pops-public directory.

I’m using a Jupyter notebook on Terra and it’s throwing

Error summary: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/ukb-diverse-pops-public/o/sumstats_release%2Fresults_full.mt?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Bucket is a requester pays bucket but no user project provided.",
    "reason" : "required"
  } ],
  "message" : "Bucket is a requester pays bucket but no user project provided."
}```

I've tried 

```import hail as hl
hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'requester-pays-bucket1,requester-pays-bucket2',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'YOUR-PROJECT-ID-HERE'
})
hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

but it throws the same error

Hey @jfu !

Google has released the latest version of the connector: Due to a changed error message, AUTO mode for requester pays buckets does not work · Issue #736 · GoogleCloudDataproc/hadoop-connectors · GitHub . The fix is in version 2.2.6. According to Google Dataproc that’s available in clusters versions 2.0.37 and later. Do you know which version you’re using? If not, you should contact Terra and request that they update to the latest cluster version.

Regarding the workaround, you need to change what I posted to fit your needs. In particular, instead of requester-pays-bucket1,requester-pays-bucket2 you should have ukb-diverse-pops-public and instead of YOUR-PROJECT-ID-HERE you need to insert the Google cloud project id corresponding to your Terra project. You might need to contact Terra support to request the project id if that isn’t shown to you in the Terra UI.

EDIT: I’ve updated the solution to be a bit more clear about the need to make these changes to the code I include.

1 Like

Thanks Dan! Must be the default dataproc terra uses is not the right version - let me try and spin up my own :slight_smile:

Hi @danking! The version on Terra by default is still stuck at 2.2.3. Can confirm that spinning up my own dataproc, which contains 2.2.6 resolves the error :slight_smile: Hopefully Terra catches up soon to avoid having to spin up a dataproc and installing dependencies.

1 Like

I’m glad to hear you’re unblocked! May I ask what you mean by installing dependencies? Our intention is for hailctl dataproc start to automatically install everything you need to start a Dataproc cluster with Hail and a notebook.

Oh whoops - didn’t realize there was that option. I manually configured a Dataproc cluster and installed python, hail, etc. Ooops! Good to know for the future, thanks!

1 Like

Are these prescriptions still valid for hail on terra? I query the environment to learn the project id and use

hl.init(spark_conf={
    'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
    'spark.hadoop.fs.gs.requester.pays.buckets': 'gs://ukb-diverse-pops-public',
    'spark.hadoop.fs.gs.requester.pays.project.id': 'terra-c9c997fd'
})

to initialize.

Then


>>> hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1398>", line 2, in read_matrix_table
  File "/opt/conda/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/lib/python3.7/site-packages/hail/methods/impex.py", line 2469, in read_matrix_table
    for rg_config in Env.backend().load_references_from_dataset(path):
  File "/opt/conda/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 337, in load_references_from_dataset
    return json.loads(self.hail_package().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/ukb-diverse-pops-public/o/sumstats_release%2Fresults_full.mt?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Bucket is a requester pays bucket but no user project provided.",
    "reason" : "required"
  } ],
  "message" : "Bucket is a requester pays bucket but no user project provided."
}

I verified that I can list the files with

gsutil -u landmarkanvil2 ls gs://ukb-diverse-pops-public/sumstats_release
gs://ukb-diverse-pops-public/sumstats_release/
gs://ukb-diverse-pops-public/sumstats_release/full_variant_qc_metrics.txt.bgz
gs://ukb-diverse-pops-public/sumstats_release/full_variant_qc_metrics.txt.bgz.tbi
gs://ukb-diverse-pops-public/sumstats_release/h2_manifest.tsv.bgz
gs://ukb-diverse-pops-public/sumstats_release/phenotype_manifest.tsv.bgz
gs://ukb-diverse-pops-public/sumstats_release/meta_analysis.h2_qc.mt/
gs://ukb-diverse-pops-public/sumstats_release/meta_analysis.mt/
gs://ukb-diverse-pops-public/sumstats_release/meta_analysis.raw.mt/
gs://ukb-diverse-pops-public/sumstats_release/results_full.mt.bak/
gs://ukb-diverse-pops-public/sumstats_release/results_full.mt/

as well as with terra’s GOOGLE_PROJECT value, but within hail I cannot access

Don’t include the gs:// in the bucket list

Yes! Thank you!