Hailctl dataproc submit fails with 'no user project provided'

I have been running a cluster (‘hail-genebass’) on dataproc and have success with a Jupyter notebook. However, I am having trouble translating that code to a script to be submitted to the cluster via hailctl dataproc submit.

The notebook code that works is:
import hail as hl
'spark.hadoop.fs.gs.requester.pays.mode': 'CUSTOM',
'spark.hadoop.fs.gs.requester.pays.buckets': 'ukbb-exome-public',
'spark.hadoop.fs.gs.requester.pays.project.id': 'human-genetics-001'
mt = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/variant_results.mt')

However, placing this code into the filetest.py file and submitting via
hailctl dataproc submit hail-genebass test.py

leads to the error:
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Bucket is a requester pays bucket but no user project provided.",
"reason" : "required"
} ],
"message" : "Bucket is a requester pays bucket but no user project provided."

Given that I started (and can execute the notebook code on) the cluster with the project and requester-pays-allow-buckets identified, I’m not sure why this submit job should fail. Do I need to re-specify the project id and allowed buckets in the submit command and if so, how?


Possibly related to recent changes to the Hadoop GCS connector. hadoop-connectors/gcs/INSTALL.md at v3.0.0 · GoogleCloudDataproc/hadoop-connectors · GitHub What version of Hail and Dataproc do you have?

Hi Dan. Hail version 0.2.128-eead8100a1c1, dataproc looks like dataproc-2-1-deb11-20231128-155100-rc01.

@MattBrauer you probably need to specify those Spark configuration parameters as --properties because of the way Spark is initialized in submission. In particular I think submission invokes pyspark which starts a Spark session for you before your Python code is executed.