Pass additional .py files with custom functions to hailctl dataproc start

I am creating a hail dataproc cluster using the following code:

hailctl dataproc start aargenti \
    --region us-central1 \
    --zone us-central1-b \
    --init gs://path/code/install_packages.sh \
    --master-machine-type n1-highmem-8 \
    --master-boot-disk-size 100 \
    --num-workers 4 \
    --worker-machine-type n1-highmem-8 \
    --worker-boot-disk-size 500 \
    --num-secondary-workers 50 \
    --secondary-worker-boot-disk-size 40 \
    --max-idle=10m

Once the cluster is running, I submit a .py file to run on the cluster (e.g., hailctl dataproc submit aargenti /path/python_script.py ). However, I would like to be able to load additional .py files into the python environment on the dataproc cluster so that I can import additional custom functions into my main python_script.py file (by adding something like from custom_functions.py import * to the top of my python_script.py file). Could you please provide some guidance on how I can achieve this? I tried listing the custom_functions.py file using the --py-files option when running hailctl dataproc start , but this doesn’t seem to work. custom_functions.py is also saved on GCP.

Thanks.

The --pyfiles argument is passed to hailctl dataproc submit, not to start.

So the submit command would look like:

hailctl dataproc submit --pyfiles /path/custom_functions.py cluster /path/python_script.py

Thanks for such a quickly reply - that worked perfectly!