I was wondering if you can help us with the following error that causes our google hail cluster to crash:
ERROR: gcloud crashed (OperationalError): disk I/O error
After this error occurs we are unable to stop the cluster via the hailtctl dataproc stop command because all processes give the same Operational error.
The script I am running successfully completed before when ran for chr20 which is of smaller size.
Now that I am trying chr2 I get this disk I/O error.These are my cluster settings:
--image-version=1.4-debian9 \
--properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extr
aJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g \
--initialization-actions=gs://hail-common/hailctl/dataproc/0.2.20/init_notebook.py \
--metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.20/hail-0.2.20-py3-none-any.whl|||PKGS=aiohttp|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|ipykernel<5|nest_async
io|numpy<2|pandas>0.22,<0.24|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|PyYAML \
--master-machine-type=n1-highmem-8 \
--master-boot-disk-size=100GB \
--num-master-local-ssds=0 \
--num-preemptible-workers=0 \
--num-worker-local-ssds=0 \
--num-workers=128 \
--preemptible-worker-boot-disk-size=40GB \
--worker-boot-disk-size=40 \
--worker-machine-type=n1-standard-8 \
--zone=europe-west2-a \
--initialization-action-timeout=20m \
--labels=creator=pa10_sanger_ac_uk
Starting cluster 'hail4676'...
Is there anything I can do to avoid this error? A larger local volume perhaps? We would appreciate any help you can give us as we really have to get this going asap and scale it for whole genome.
Thanks,
Pavlos
Read chr2 mt
Split multialleles
checkpoint split matrixtable
[Stage 3:==============================================> (671 + 8) / 800]ERROR: gcloud crashed (OperationalError): disk I/O error
If you would like to report this issue, please run the following command:
gcloud feedback
To check gcloud for common problems, please run the following command:
gcloud info --run-diagnostics
Submitting to cluster 'hail6880'...
gcloud command:
gcloud dataproc jobs submit pyspark elgh_hail_analysis/kk8_code_google.py \
--cluster=hail6880 \
--files= \
--py-files=/tmp/pyscripts_5rn04sv4.zip \
--properties=spark.dynamicAllocation.enabled=false
Traceback (most recent call last):
File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/bin/hailctl", line 11, in <module>
sys.exit(main())
File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 91, in main
cli.main(args)
File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 106, in main
jmp[args.module].main(args, pass_through_args)
File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/submit.py", line 72, in main
check_call(cmd)
File "/software/python-3.6.1/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'elgh_hail_analysis/kk8_code_google.py', '--cluster=hail6880', '--files=', '--py-files=/tmp/pyscripts_5rn04sv4.zip', '--properties=spark.dynamicAllocation.enabled=false']' returned non-zero exit status 1.