ERROR: gcloud crashed (OperationalError): disk I/O error

pavlos · September 12, 2019, 5:06pm

I was wondering if you can help us with the following error that causes our google hail cluster to crash:

After this error occurs we are unable to stop the cluster via the hailtctl dataproc stop command because all processes give the same Operational error.

The script I am running successfully completed before when ran for chr20 which is of smaller size.

Now that I am trying chr2 I get this disk I/O error.These are my cluster settings:

    --image-version=1.4-debian9 \
    --properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extr
aJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g \
    --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.20/init_notebook.py \
    --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.20/hail-0.2.20-py3-none-any.whl|||PKGS=aiohttp|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|ipykernel<5|nest_async
io|numpy<2|pandas>0.22,<0.24|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|PyYAML \
    --master-machine-type=n1-highmem-8 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-preemptible-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=128 \
    --preemptible-worker-boot-disk-size=40GB \
    --worker-boot-disk-size=40 \
    --worker-machine-type=n1-standard-8 \
    --zone=europe-west2-a \
    --initialization-action-timeout=20m \
    --labels=creator=pa10_sanger_ac_uk
Starting cluster 'hail4676'...

Is there anything I can do to avoid this error? A larger local volume perhaps? We would appreciate any help you can give us as we really have to get this going asap and scale it for whole genome.

Thanks,

Pavlos

Read chr2 mt 

Split multialleles

checkpoint split matrixtable

[Stage 3:==============================================>  (671 + 8) / 800]ERROR: gcloud crashed (OperationalError): disk I/O error

 

If you would like to report this issue, please run the following command:

  gcloud feedback

 

To check gcloud for common problems, please run the following command:

  gcloud info --run-diagnostics

Submitting to cluster 'hail6880'...

gcloud command:

gcloud dataproc jobs submit pyspark elgh_hail_analysis/kk8_code_google.py \

 --cluster=hail6880 \

 --files= \

 --py-files=/tmp/pyscripts_5rn04sv4.zip \

 --properties=spark.dynamicAllocation.enabled=false

Traceback (most recent call last):

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/bin/hailctl", line 11, in <module>

 sys.exit(main())

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 91, in main

 cli.main(args)

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 106, in main

 jmp[args.module].main(args, pass_through_args)

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/submit.py", line 72, in main

 check_call(cmd)

  File "/software/python-3.6.1/lib/python3.6/subprocess.py", line 291, in check_call

 raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'elgh_hail_analysis/kk8_code_google.py', '--cluster=hail6880', '--files=', '--py-files=/tmp/pyscripts_5rn04sv4.zip', '--properties=spark.dynamicAllocation.enabled=false']' returned non-zero exit status 1.

danking · September 12, 2019, 5:11pm

Bizarre. This sounds like an issue on your laptop though, not in the cluster. What happened to the dataproc cluster? Is it still present in the GCP website?

pavlos · September 12, 2019, 5:18pm

Yes I cannot stop it as even this command gives me the same error:

hailctl dataproc stop hail6880
Stopping cluster 'hail6880'...
WARNING: Dataproc --region flag will become required in January 2020. Please either specify this flag, or set default by running 'gcloud config set dataproc/region <your-default-region>'
ERROR: gcloud crashed (OperationalError): disk I/O error


If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
Traceback (most recent call last):
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/bin/hailctl", line 11, in <module>
    sys.exit(main())
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 91, in main
    cli.main(args)
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 106, in main
    jmp[args.module].main(args, pass_through_args)
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/stop.py", line 15, in main
    check_call(cmd)
  File "/software/python-3.6.1/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'clusters', 'delete', '--quiet', 'hail6880']' returned non-zero exit status 1.

danking · September 12, 2019, 5:24pm

It looks like you need to get in touch with google about this. There’s some bad interaction between your environment and gcloud. I’d start with gcloud info --run-diagnostics, that might give you a clue. I suspect you either have no available disk space or do not have permissions to a directory gcloud wants to read/write (e.g. home directory, install directory of gcloud, …)

pavlos · September 12, 2019, 5:43pm

Thanks Dan. I have run the diagnostics command and it passes all the tests with no issues. Have contacted google and waiting to hear back.

Topic		Replies	Views
Hail cluster creation error Hail Query & hailctl	18	1938	May 14, 2020
Hail 0.2.38 missing notebook init Hail Query & hailctl	7	462	April 23, 2020
Gcloud job failure Hail Query & hailctl	0	423	February 1, 2023
"Hail off-heap memory exceeded maximum threshold" error on large analysis job Hail Query & hailctl	1	304	April 18, 2023
Hailctl breaks with latest GCP SDK 294.0.0 Hail Query & hailctl	2	536	March 11, 2020

ERROR: gcloud crashed (OperationalError): disk I/O error

Related topics