Error summary: OutOfMemoryError: Java heap space

danking · August 16, 2022, 4:18pm

OK. A few things!

When using Hail on a single, large server, you need to explicitly tell Apache Spark how much memory is available. See details here: How do I increase the memory or RAM available to the JVM when I start Hail through Python? - #2 by danking. In particular, you might try starting Jupyter this way:

PYSPARK_SUBMIT_ARGS="--driver-memory 460g --executor-memory 460g pyspark-shell" jupyter notebook

When running PCA, you definitely do not need 24M variants. Assuming that you are using PCA to interrogate the ancestry of your samples, common variants are sufficient. I suggest something like this:

EUR_for_pca = EUR_mt_full
EUR_for_pca = hl.variant_qc(EUR_for_pca)
# filter to variants with minor allele frequency >5%
EUR_for_pca = EUR_for_pca.filter_rows(
    (EUR_for_pca.variant_qc.AF[0] > 0.05) & (EUR_for_pca.variant_qc.AF[0] < 0.95)
)
n_common_variants = EUR_for_pca.count_rows()
# keep a random ~10k subset of common variants 
EUR_for_pca = EUR_for_pca.sample_variants(10_000 / n_common_rows)
# save the set of variants for later use
EUR_for_pca.rows().write('Haill_mt/variants_for_pca.ht')
EUR_pca_variants = hl.read_table('Haill_mt/variants_for_pca.ht')
# filter the matrix table to just the PCA variants
EUR_for_pca = EUR_mt_full.semi_join_rows(EUR_pca_variants)
EUR_eigenvalues, EUR_pcs, _ = hl.hwe_normalized_pca(EUR_for_pca.GT)

Topic		Replies	Views
PCA job aborted from SparkException Hail Query & hailctl	46	2753	July 28, 2020
Still struggling with OutOfMemoryError: Java heap space Hail Query & hailctl	0	1105	March 4, 2019
Table from pandas dataframe/aggregate problem Hail Query & hailctl	20	1524	January 23, 2020
Heap out of memory Hail Query & hailctl	14	1862	July 21, 2020
Pc_rel memory issue: ConnectionRefusedError: [Errno 111] Connection refused Hail Query & hailctl	10	849	June 11, 2024

Error summary: OutOfMemoryError: Java heap space

Related topics