Error: "Bucket is a requester pays bucket but no user project provided."

Hi! I am trying to annotate a matrix with CADD scores.

db = hl.experimental.DB(region=‘us’, cloud=‘gcp’)
mt = db.annotate_rows_db(mt, ‘CADD’)

Tried to manually specify as this post suggests (I'm encountering "Bucket is a requester pays bucket but no user project provided."), but the error persists.

import hail as hl
hl.init(spark_conf={
‘spark.hadoop.fs.gs.requester.pays.mode’: ‘CUSTOM’,
‘spark.hadoop.fs.gs.requester.pays.buckets’: ‘hail-datasets-us’,
‘spark.hadoop.fs.gs.requester.pays.project.id’: ‘caddprojectnew’
})

Any help would be greatly appreciated!

Hey @laf ,

What version of Hail is installed? Are you using Google Dataproc?

Yes using dataproc and Hail version 0.2.99-57537fea08d4, thank you!

When starting a dataproc cluster, you have to tell hailctl to use enable requester pays:

  --requester-pays-allow-all
                        Allow reading from all requester-pays buckets.
  --requester-pays-allow-buckets REQUESTER_PAYS_ALLOW_BUCKETS
                        Comma-separated list of requester-pays buckets to
                        allow reading from.
  --requester-pays-allow-annotation-db
                        Allows reading from any of the requester-pays buckets
                        that hold data for the annotation database.

For dataproc clusters, you can’t do this with hl.init because Spark is already running on the cluster. hl.init only works when running locally or when initializing a Spark cluster from scratch (not using Dataproc, EMR, or HDInsight).

@danking Ok thank you, I do think I did this when I originally started the cluster like this

hailctl dataproc start clutstercadd --requester-pays-allow-annotation-db --region=‘us-east1’

Can you share the full stack trace and error message you’re getting? If you have the hail log file that will help as well.

Here it is, thank you so much for the help!

FatalError Traceback (most recent call last)
in
----> 1 y2 = db.annotate_rows_db(inty, ‘CADD’)

in annotate_rows_db(self, rel, *names)

~/lanefitz/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

~/lanefitz/lib/python3.7/site-packages/hail/experimental/db.py in annotate_rows_db(self, rel, *names)
534 rel = rel.annotate(**{dataset.name: genes.index(rel.key)[dataset.name]})
535 else:
→ 536 indexed_value = dataset.index_compatible_version(rel.key)
537 if isinstance(indexed_value.dtype, hl.tstruct) and len(indexed_value.dtype) == 0:
538 indexed_value = hl.is_defined(indexed_value)

~/lanefitz/lib/python3.7/site-packages/hail/experimental/db.py in index_compatible_version(self, key_expr)
266 compatible_indexed_values = [
267 (version.maybe_index(key_expr, all_matches), version.version)
→ 268 for version in self.versions
269 if version.maybe_index(key_expr, all_matches) is not None
270 ]

~/lanefitz/lib/python3.7/site-packages/hail/experimental/db.py in (.0)
267 (version.maybe_index(key_expr, all_matches), version.version)
268 for version in self.versions
→ 269 if version.maybe_index(key_expr, all_matches) is not None
270 ]
271 if len(compatible_indexed_values) == 0:

~/lanefitz/lib/python3.7/site-packages/hail/experimental/db.py in maybe_index(self, indexer_key_expr, all_matches)
153 Struct of compatible indexed values, if they exist.
154 “”"
→ 155 return hl.read_table(self.url)._maybe_flexindex_table_by_expr(
156 indexer_key_expr, all_matches=all_matches)
157

in read_table(path, _intervals, _filter_intervals, _n_partitions, _assert_type, _load_refs)

~/lanefitz/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

~/lanefitz/lib/python3.7/site-packages/hail/methods/impex.py in read_table(path, _intervals, _filter_intervals, _n_partitions, _assert_type, _load_refs)
2923 “”"
2924 if _load_refs:
→ 2925 for rg_config in Env.backend().load_references_from_dataset(path):
2926 hl.ReferenceGenome._from_config(rg_config)
2927

~/lanefitz/lib/python3.7/site-packages/hail/backend/spark_backend.py in load_references_from_dataset(self, path)
326
327 def load_references_from_dataset(self, path):
→ 328 return json.loads(self.hail_package().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
329
330 def from_fasta_file(self, name, fasta_file, index_file, x_contigs, y_contigs, mt_contigs, par):

~/lanefitz/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
→ 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:

~/lanefitz/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 tpl = Env.jutils().handleForPython(e.java_exception)
30 deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
—> 31 raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’

FatalError: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/hail-datasets-us/o/CADD%2Fv1.4%2FGRCh37%2Ftable.ht?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
“code” : 400,
“errors” : [ {
“domain” : “global”,
“message” : “Bucket is a requester pays bucket but no user project provided.”,
“reason” : “required”
} ],
“message” : “Bucket is a requester pays bucket but no user project provided.”
}

Java stack trace:
java.io.IOException: Error accessing gs://hail-datasets-us/CADD/v1.4/GRCh37/table.ht
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2221)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:2108)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSystem.java:1091)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1065)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:955)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:166)
at is.hail.io.fs.FS.isDir(FS.scala:364)
at is.hail.io.fs.FS.isDir$(FS.scala:362)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:72)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:31)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:74)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at jdk.internal.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)

com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/hail-datasets-us/o/CADD%2Fv1.4%2FGRCh37%2Ftable.ht?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
“code” : 400,
“errors” : [ {
“domain” : “global”,
“message” : “Bucket is a requester pays bucket but no user project provided.”,
“reason” : “required”
} ],
“message” : “Bucket is a requester pays bucket but no user project provided.”
}
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2215)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:2108)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSyst
em.java:1091)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:
1065)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:955)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:166)
at is.hail.io.fs.FS.isDir(FS.scala:364)
at is.hail.io.fs.FS.isDir$(FS.scala:362)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:72)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:31)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:74)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at jdk.internal.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Hail version: 0.2.99-57537fea08d4
Error summary: GoogleJsonResponseException: 400 Bad Request
GET https://storage.googleapis.com/storage/v1/b/hail-datasets-us/o/CADD%2Fv1.4%2FGRCh37%2Ftable.ht?fields=bucket,name,timeCreated,updated,generation,metage
neration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
“code” : 400,
“errors” : [ {
“domain” : “global”,
“message” : “Bucket is a requester pays bucket but no user project provided.”,
“reason” : “required”
} ],
“message” : “Bucket is a requester pays bucket but no user project provided.”
}

https://o2portal.rc.hms.harvard.edu/node/compute-e-16-233.o2.rc.hms.harvard.edu/17606/edit/hail-20221006-1242-0.2.99-57537fea08d4.log

OK, and just to be totally clear, you started the cluster like this:

hailctl dataproc start clutstercadd --requester-pays-allow-annotation-db --region='us-east1'

And the Python code you’re executing is this:

import hail as hl
hl.init(spark_conf={
'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
'spark.hadoop.fs.gs.requester.pays.buckets': 'hail-datasets-us',
'spark.hadoop.fs.gs.requester.pays.project.id': 'caddprojectnew'
})

db = hl.experimental.DB(region=‘us’, cloud=‘gcp’)
mt = db.annotate_rows_db(mt, ‘CADD’)

Have you also tried without the spark_conf argument?

Let’s confirm that your spark configuration is correct. Can you navigate to the spark UI page?

hailctl dataproc connect clustercadd spark-ui

Then click “environment”. I believe in here it should tell you the configuration of things like “hadoop.fs.gs.requester.pays.buckets”. Can you confirm that those are set to the values you provided?

Yes that all looks correct. Have tried with and without the spark_conf. Ran that spark-ui command, but it needs chromium so I’m trying to get it now. Sorry I’m new to this!

1 Like

I’m on terminal so having a hard time connecting browser. If there’s another way to check this I’ll try that too!

Hi! This is what I get when I navigate to the spark UI page by running the command you sent, please let me know if that gives you any insight as to what I am doing wrong – thank you!