Maximal_independent_set may run out of memory warning

Hi, I’m running a script that dies after giving the warning “maximal_independent_set may run out of memory” - do have any advice on this?

2022-08-02 16:21:36 Hail: INFO: ld_prune: running local pruning stage with max queue size of 541201 variants
2022-08-02 16:24:01 Hail: INFO: wrote table with 3356459 rows in 115376 partitions to /tmp/M9ZpN5CoeFopVRDnpKVv5V
    Total size: 95.67 MiB
    * Rows: 95.67 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  77 rows (2.33 KiB)
2022-08-02 16:28:36 Hail: INFO: Wrote all 820 blocks of 3356459 x 294 matrix with block size 4096.
2022-08-02 16:35:52 Hail: INFO: wrote table with 48565726 rows in 1639 partitions to /tmp/tfHThUiCD9m00Bz2BKBVyt
    Total size: 533.95 MiB
    * Rows: 500.28 MiB
    * Globals: 33.66 MiB
    * Smallest partition: 419 rows (3.91 KiB)
    * Largest partition:  78125 rows (840.47 KiB)
2022-08-02 16:39:12 Hail: WARN: over 400,000 edges are in the graph; maximal_independent_set may run out of memory
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [db790e4009254d9893a4dc042d62d717] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/db790e4009254d9893a4dc042d62d717?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait 'db790e4009254d9893a4dc042d62d717' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/driveroutput
Traceback (most recent call last):
  File "/Users/kristen/miniconda3/envs/hail/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 107, in main
    cli.main(args)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 124, in main
    jmp[args.module].main(args, pass_through_args))
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 88, in main
    gcloud.run(cmd)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

Script:

Command:
hailctl dataproc submit kml subpop_analysis.py
–run-filter-subpop-qc
–run-subpop-pca
–high-quality
–pop mid
–overwrite
–max-proportion-mislabeled-training-samples .90

Cluster:
hailctl dataproc start kml --worker-machine-type n1-highmem-16 --num-workers 30 --init gs://gnomad-kristen/mitochondria/broad_master-init.sh --requester-pays-allow-buckets gnomad --project broad-mpg-gnomad --max-idle=30m

Can you point me to where ld_prune is being called? When I search the script you linked I don’t see it.

it gets called within the get_qc_mt function:

Something I might try here is, before the call to ld_prune, writing to disk and reading back with many fewer partitions (since this is after a frequency filter that removes most variants). This will allow the fast local per-partition prune to work much better, and you’ll end up with a smaller graph to prune in memory at the end.

This step runs properly after incorporating your suggestions, thank you!

great! we’re working on an overhaul of ld_prune generally that will do this automatically.