I have been running a cluster (‘hail-genebass’) on dataproc
and have success with a Jupyter notebook. However, I am having trouble translating that code to a script to be submitted to the cluster via hailctl dataproc submit
.
The notebook code that works is:
import hail as hl
hl.init(spark_conf={
'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
'spark.hadoop.fs.gs.requester.pays.buckets': 'ukbb-exome-public',
'spark.hadoop.fs.gs.requester.pays.project.id': 'human-genetics-001'
})
mt = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/variant_results.mt')
However, placing this code into the filetest.py
file and submitting via
hailctl dataproc submit hail-genebass test.py
leads to the error:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Bucket is a requester pays bucket but no user project provided.",
"reason" : "required"
} ],
"message" : "Bucket is a requester pays bucket but no user project provided."
}
Given that I started (and can execute the notebook code on) the cluster with the project and requester-pays-allow-buckets identified, I’m not sure why this submit
job should fail. Do I need to re-specify the project id and allowed buckets in the submit command and if so, how?
Thanks.