ERROR: gcloud crashed (OperationalError): disk I/O error

I was wondering if you can help us with the following error that causes our google hail cluster to crash:

ERROR: gcloud crashed (OperationalError): disk I/O error

After this error occurs we are unable to stop the cluster via the hailtctl dataproc stop command because all processes give the same Operational error.

The script I am running successfully completed before when ran for chr20 which is of smaller size.

Now that I am trying chr2 I get this disk I/O error.These are my cluster settings:

    --image-version=1.4-debian9 \
    --properties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extr
aJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g \
    --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.20/init_notebook.py \
    --metadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.20/hail-0.2.20-py3-none-any.whl|||PKGS=aiohttp|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|ipykernel<5|nest_async
io|numpy<2|pandas>0.22,<0.24|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|PyYAML \
    --master-machine-type=n1-highmem-8 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-preemptible-workers=0 \
    --num-worker-local-ssds=0 \
    --num-workers=128 \
    --preemptible-worker-boot-disk-size=40GB \
    --worker-boot-disk-size=40 \
    --worker-machine-type=n1-standard-8 \
    --zone=europe-west2-a \
    --initialization-action-timeout=20m \
    --labels=creator=pa10_sanger_ac_uk
Starting cluster 'hail4676'...

Is there anything I can do to avoid this error? A larger local volume perhaps? We would appreciate any help you can give us as we really have to get this going asap and scale it for whole genome.

Thanks,

Pavlos

Read chr2 mt 

Split multialleles

checkpoint split matrixtable

[Stage 3:==============================================>  (671 + 8) / 800]ERROR: gcloud crashed (OperationalError): disk I/O error

 

If you would like to report this issue, please run the following command:

  gcloud feedback

 

To check gcloud for common problems, please run the following command:

  gcloud info --run-diagnostics

Submitting to cluster 'hail6880'...

gcloud command:

gcloud dataproc jobs submit pyspark elgh_hail_analysis/kk8_code_google.py \

 --cluster=hail6880 \

 --files= \

 --py-files=/tmp/pyscripts_5rn04sv4.zip \

 --properties=spark.dynamicAllocation.enabled=false

Traceback (most recent call last):

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/bin/hailctl", line 11, in <module>

 sys.exit(main())

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 91, in main

 cli.main(args)

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 106, in main

 jmp[args.module].main(args, pass_through_args)

  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/submit.py", line 72, in main

 check_call(cmd)

  File "/software/python-3.6.1/lib/python3.6/subprocess.py", line 291, in check_call

 raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'elgh_hail_analysis/kk8_code_google.py', '--cluster=hail6880', '--files=', '--py-files=/tmp/pyscripts_5rn04sv4.zip', '--properties=spark.dynamicAllocation.enabled=false']' returned non-zero exit status 1.

Bizarre. This sounds like an issue on your laptop though, not in the cluster. What happened to the dataproc cluster? Is it still present in the GCP website?

Yes I cannot stop it as even this command gives me the same error:

hailctl dataproc stop hail6880
Stopping cluster 'hail6880'...
WARNING: Dataproc --region flag will become required in January 2020. Please either specify this flag, or set default by running 'gcloud config set dataproc/region <your-default-region>'
ERROR: gcloud crashed (OperationalError): disk I/O error


If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
Traceback (most recent call last):
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/bin/hailctl", line 11, in <module>
    sys.exit(main())
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/__main__.py", line 91, in main
    cli.main(args)
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/cli.py", line 106, in main
    jmp[args.module].main(args, pass_through_args)
  File "/lustre/scratch119/realdata/mdt3/teams/hgi/hail/interval_wgs_upload/google/.venv/lib/python3.6/site-packages/hailtop/hailctl/dataproc/stop.py", line 15, in main
    check_call(cmd)
  File "/software/python-3.6.1/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'clusters', 'delete', '--quiet', 'hail6880']' returned non-zero exit status 1.

It looks like you need to get in touch with google about this. There’s some bad interaction between your environment and gcloud. I’d start with gcloud info --run-diagnostics, that might give you a clue. I suspect you either have no available disk space or do not have permissions to a directory gcloud wants to read/write (e.g. home directory, install directory of gcloud, …)

Thanks Dan. I have run the diagnostics command and it passes all the tests with no issues. Have contacted google and waiting to hear back.