I am creating a hail dataproc cluster using the following code:
hailctl dataproc start aargenti \
--region us-central1 \
--zone us-central1-b \
--init gs://path/code/install_packages.sh \
--master-machine-type n1-highmem-8 \
--master-boot-disk-size 100 \
--num-workers 4 \
--worker-machine-type n1-highmem-8 \
--worker-boot-disk-size 500 \
--num-secondary-workers 50 \
--secondary-worker-boot-disk-size 40 \
--max-idle=10m
Once the cluster is running, I submit a .py file to run on the cluster (e.g., hailctl dataproc submit aargenti /path/python_script.py
). However, I would like to be able to load additional .py files into the python environment on the dataproc cluster so that I can import additional custom functions into my main python_script.py file (by adding something like from custom_functions.py import *
to the top of my python_script.py file). Could you please provide some guidance on how I can achieve this? I tried listing the custom_functions.py file using the --py-files option when running hailctl dataproc start
, but this doesn’t seem to work. custom_functions.py is also saved on GCP.
Thanks.