How to keep the kernel alive?

hyslin · October 23, 2024, 8:42pm

Hi,

I’m working with WGS data to compute PGS for a few phenotypes, but I encountered an issue where the kernel died during the job. I’ve benchmarked a few parameters with a small subset, but it doesn’t seem scalable or reproducible for the full dataset. Any suggestions on which options below might prevent this would be greatly appreciated.

When initializing Hail, should I run the first or second command? I have 16 CPUs and 60GB RAM in my cloud compute profile, but only 4 CPUs and 15GB RAM for the worker configuration in my All of Us environment.

hl.init(default_reference = "GRCh38", spark_conf={'spark.driver.memory': '60g'})
hl.init(default_reference = "GRCh38")

I filtered the WGS matrix table with QC’ed variants and retrieved relevant fields for downstream analysis. Is the order of the following three commands optimal?

mt = mt.semi_join_rows(var_wgs)
mt = mt.select_entries(GT = mt.GT.n_alt_alleles())
mt = mt.select_rows(rsid = mt.rsid)

Next, I filtered the samples. Is there anything I can simplify in this step?

sample = hl.import_table("XXXX", 
                         missing='',
                        impute=True,
                        types = {"person_id": "str"})
sample = sample.key_by("person_id")
mt = mt.semi_join_cols(sample)
mt = mt.annotate_cols(**sample[mt.s])

This step, which uses hl.agg.sum, seems more computationally intensive. Any suggestions for optimization?

pgs = hl.struct(pgs1 = hl.agg.sum(mt.sumstats.beta_thresh1 * mt.GT), ...)))

I’m having difficulty optimizing the last few pieces of code. For WGS-scale data, how many partitions should I use? Should I prefer mt.repartition or mt.naive_coalesce? (I used 2 workers and 100 preemptible workers, if that helps with the calculation.)

# Benchmarking commands:
mt = mt.repartition(2000)
mt = mt.repartition(2000, shuffle=False)
mt = mt.naive_coalesce(2000)

I save the Hail table as a checkpoint for now, but I’ll eventually export it as tsv.bgz. Anything here can be adjusted to save memory?

sample_info = sample_info.checkpoint(export_filename, overwrite=True)

Topic		Replies	Views
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1238	February 3, 2021
Computation speed of hail aggregation Hail Query & hailctl	12	778	February 26, 2025
Google cloud speed up Hail Query & hailctl	10	846	September 18, 2019
Trouble saving a large MatrixTable Hail Query & hailctl	6	115	October 30, 2024
Java Heap Space out of memory Hail Query & hailctl	5	3626	August 10, 2020

How to keep the kernel alive?

Related topics