Maximal_independent_set may run out of memory warning

Hi, I’m running a script that dies after giving the warning “maximal_independent_set may run out of memory” - do have any advice on this?

2022-08-02 16:21:36 Hail: INFO: ld_prune: running local pruning stage with max queue size of 541201 variants
2022-08-02 16:24:01 Hail: INFO: wrote table with 3356459 rows in 115376 partitions to /tmp/M9ZpN5CoeFopVRDnpKVv5V
    Total size: 95.67 MiB
    * Rows: 95.67 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  77 rows (2.33 KiB)
2022-08-02 16:28:36 Hail: INFO: Wrote all 820 blocks of 3356459 x 294 matrix with block size 4096.
2022-08-02 16:35:52 Hail: INFO: wrote table with 48565726 rows in 1639 partitions to /tmp/tfHThUiCD9m00Bz2BKBVyt
    Total size: 533.95 MiB
    * Rows: 500.28 MiB
    * Globals: 33.66 MiB
    * Smallest partition: 419 rows (3.91 KiB)
    * Largest partition:  78125 rows (840.47 KiB)
2022-08-02 16:39:12 Hail: WARN: over 400,000 edges are in the graph; maximal_independent_set may run out of memory
ERROR: ( Job [db790e4009254d9893a4dc042d62d717] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
gcloud dataproc jobs wait 'db790e4009254d9893a4dc042d62d717' --region 'us-central1' --project 'broad-mpg-gnomad'
Traceback (most recent call last):
  File "/Users/kristen/miniconda3/envs/hail/bin/hailctl", line 8, in <module>
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/", line 107, in main
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/", line 124, in main
    jmp[args.module].main(args, pass_through_args))
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/asyncio/", line 587, in run_until_complete
    return future.result()
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/", line 88, in main
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/", line 363, in check_call
    raise CalledProcessError(retcode, cmd)


hailctl dataproc submit kml
–pop mid
–max-proportion-mislabeled-training-samples .90

hailctl dataproc start kml --worker-machine-type n1-highmem-16 --num-workers 30 --init gs://gnomad-kristen/mitochondria/ --requester-pays-allow-buckets gnomad --project broad-mpg-gnomad --max-idle=30m

Can you point me to where ld_prune is being called? When I search the script you linked I don’t see it.

it gets called within the get_qc_mt function:

Something I might try here is, before the call to ld_prune, writing to disk and reading back with many fewer partitions (since this is after a frequency filter that removes most variants). This will allow the fast local per-partition prune to work much better, and you’ll end up with a smaller graph to prune in memory at the end.

This step runs properly after incorporating your suggestions, thank you!

great! we’re working on an overhaul of ld_prune generally that will do this automatically.