Maximal_independent_set may run out of memory warning

klaricch · August 8, 2022, 5:26pm

Hi, I’m running a script that dies after giving the warning “maximal_independent_set may run out of memory” - do have any advice on this?

2022-08-02 16:21:36 Hail: INFO: ld_prune: running local pruning stage with max queue size of 541201 variants
2022-08-02 16:24:01 Hail: INFO: wrote table with 3356459 rows in 115376 partitions to /tmp/M9ZpN5CoeFopVRDnpKVv5V
    Total size: 95.67 MiB
    * Rows: 95.67 MiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  77 rows (2.33 KiB)
2022-08-02 16:28:36 Hail: INFO: Wrote all 820 blocks of 3356459 x 294 matrix with block size 4096.
2022-08-02 16:35:52 Hail: INFO: wrote table with 48565726 rows in 1639 partitions to /tmp/tfHThUiCD9m00Bz2BKBVyt
    Total size: 533.95 MiB
    * Rows: 500.28 MiB
    * Globals: 33.66 MiB
    * Smallest partition: 419 rows (3.91 KiB)
    * Largest partition:  78125 rows (840.47 KiB)
2022-08-02 16:39:12 Hail: WARN: over 400,000 edges are in the graph; maximal_independent_set may run out of memory
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [db790e4009254d9893a4dc042d62d717] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/db790e4009254d9893a4dc042d62d717?project=broad-mpg-gnomad&region=us-central1
gcloud dataproc jobs wait 'db790e4009254d9893a4dc042d62d717' --region 'us-central1' --project 'broad-mpg-gnomad'
https://console.cloud.google.com/storage/browser/dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/
gs://dataproc-faa46220-ec08-4f5b-92bd-9722e1963047-us-central1/google-cloud-dataproc-metainfo/c9402aa7-e519-41e2-b3e0-7c91dc3d72dc/jobs/db790e4009254d9893a4dc042d62d717/driveroutput
Traceback (most recent call last):
  File "/Users/kristen/miniconda3/envs/hail/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 107, in main
    cli.main(args)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 124, in main
    jmp[args.module].main(args, pass_through_args))
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 88, in main
    gcloud.run(cmd)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kristen/miniconda3/envs/hail/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

Script:

github.com

broadinstitute/gnomad_qc/blob/main/gnomad_qc/v3/sample_qc/subpop_analysis.py

import argparse
import hail as hl
import logging

from gnomad.sample_qc.ancestry import run_pca_with_relateds

from gnomad_qc.v3.resources.basics import get_checkpoint_path, get_logging_path
from gnomad_qc.v3.resources.meta import meta
from gnomad_qc.v3.resources.sample_qc import (
    ancestry_pca_eigenvalues,
    ancestry_pca_loadings,
    ancestry_pca_scores,
    assigned_subpops,
    filtered_subpop_qc_mt,
    pca_related_samples_to_drop,
    subpop_qc,
)
from gnomad_qc.v3.sample_qc.sample_qc import assign_pops

from gnomad.resources.grch38.reference_data import lcr_intervals

This file has been truncated. show original

Command:
hailctl dataproc submit kml subpop_analysis.py
–run-filter-subpop-qc
–run-subpop-pca
–high-quality
–pop mid
–overwrite
–max-proportion-mislabeled-training-samples .90

Cluster:
hailctl dataproc start kml --worker-machine-type n1-highmem-16 --num-workers 30 --init gs://gnomad-kristen/mitochondria/broad_master-init.sh --requester-pays-allow-buckets gnomad --project broad-mpg-gnomad --max-idle=30m

tpoterba · August 8, 2022, 5:30pm

Can you point me to where ld_prune is being called? When I search the script you linked I don’t see it.

klaricch · August 8, 2022, 6:17pm

it gets called within the get_qc_mt function:

github.com

broadinstitute/gnomad_methods/blob/7183c5654f890d7a50d5dae2f92f364d8b11a6ef/gnomad/sample_qc/pipeline.py#L198


      
          )
          
          
if ld_r2 is not None:
              if checkpoint_path:
                  logger.info("Checkpointing the MT and LD pruning")
                  qc_mt = qc_mt.checkpoint(checkpoint_path, overwrite=True)
              else:
                  logger.info("Persisting the MT and LD pruning")
                  qc_mt = qc_mt.persist()
              unfiltered_qc_mt = qc_mt.unfilter_entries()
              pruned_ht = hl.ld_prune(unfiltered_qc_mt.GT, r2=ld_r2)
              qc_mt = qc_mt.filter_rows(hl.is_defined(pruned_ht[qc_mt.row_key]))
          
          
qc_mt = qc_mt.annotate_globals(
              qc_mt_params=hl.struct(
                  bi_allelic_only=bi_allelic_only,
                  snv_only=snv_only,
                  adj_only=adj_only,
                  min_af=min_af if min_af is not None else hl.null(hl.tfloat32),
                  min_callrate=min_callrate
                  if min_callrate is not None

tpoterba · August 8, 2022, 8:17pm

Something I might try here is, before the call to ld_prune, writing to disk and reading back with many fewer partitions (since this is after a frequency filter that removes most variants). This will allow the fast local per-partition prune to work much better, and you’ll end up with a smaller graph to prune in memory at the end.

klaricch · August 15, 2022, 6:45pm

This step runs properly after incorporating your suggestions, thank you!

tpoterba · August 15, 2022, 6:46pm

great! we’re working on an overhaul of ld_prune generally that will do this automatically.

Topic		Replies	Views
Ld_prune() out of memory Hail Query & hailctl	9	492	March 14, 2022
Hl.maximal_independent_set - job 'cancelled because SparkContext was shut down' Hail Query & hailctl	28	7594	February 4, 2021
"Hail off-heap memory exceeded maximum threshold" error on large analysis job Hail Query & hailctl	1	304	April 18, 2023
Hail off-heap memory exceeded maximum threshold Hail Query & hailctl	6	642	February 16, 2022
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	694	January 21, 2020

Maximal_independent_set may run out of memory warning

Related topics