Hi, I’m running a script that dies after giving the warning “maximal_independent_set may run out of memory” - do have any advice on this?
2022-08-02 16:21:36 Hail: INFO: ld_prune: running local pruning stage with max queue size of 541201 variants
2022-08-02 16:24:01 Hail: INFO: wrote table with 3356459 rows in 115376 partitions to /tmp/M9ZpN5CoeFopVRDnpKVv5V
Total size: 95.67 MiB
* Rows: 95.67 MiB
* Globals: 11.00 B
* Smallest partition: 0 rows (21.00 B)
* Largest partition: 77 rows (2.33 KiB)
2022-08-02 16:28:36 Hail: INFO: Wrote all 820 blocks of 3356459 x 294 matrix with block size 4096.
2022-08-02 16:35:52 Hail: INFO: wrote table with 48565726 rows in 1639 partitions to /tmp/tfHThUiCD9m00Bz2BKBVyt
Total size: 533.95 MiB
* Rows: 500.28 MiB
* Globals: 33.66 MiB
* Smallest partition: 419 rows (3.91 KiB)
* Largest partition: 78125 rows (840.47 KiB)
2022-08-02 16:39:12 Hail: WARN: over 400,000 edges are in the graph; maximal_independent_set may run out of memory
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [db790e4009254d9893a4dc042d62d717] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/db790e4009254d9893a4dc042d62d717?project=broad-mpg-gnomad®ion=us-central1
gcloud dataproc jobs wait 'db790e4009254d9893a4dc042d62d717' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/driveroutput
Traceback (most recent call last):
File "/Users/kristen/miniconda3/envs/hail/bin/hailctl", line 8, in <module>
sys.exit(main())
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 107, in main
cli.main(args)
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 124, in main
jmp[args.module].main(args, pass_through_args))
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 88, in main
gcloud.run(cmd)
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
return subprocess.check_call(["gcloud"] + command)
File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
Script:
Command:
hailctl dataproc submit kml subpop_analysis.py
–run-filter-subpop-qc
–run-subpop-pca
–high-quality
–pop mid
–overwrite
–max-proportion-mislabeled-training-samples .90
Cluster:
hailctl dataproc start kml --worker-machine-type n1-highmem-16 --num-workers 30 --init gs://gnomad-kristen/mitochondria/broad_master-init.sh --requester-pays-allow-buckets gnomad --project broad-mpg-gnomad --max-idle=30m