How to keep the kernel alive?

Hi,

I’m working with WGS data to compute PGS for a few phenotypes, but I encountered an issue where the kernel died during the job. I’ve benchmarked a few parameters with a small subset, but it doesn’t seem scalable or reproducible for the full dataset. Any suggestions on which options below might prevent this would be greatly appreciated.

  1. When initializing Hail, should I run the first or second command? I have 16 CPUs and 60GB RAM in my cloud compute profile, but only 4 CPUs and 15GB RAM for the worker configuration in my All of Us environment.
hl.init(default_reference = "GRCh38", spark_conf={'spark.driver.memory': '60g'})
hl.init(default_reference = "GRCh38")
  1. I filtered the WGS matrix table with QC’ed variants and retrieved relevant fields for downstream analysis. Is the order of the following three commands optimal?
mt = mt.semi_join_rows(var_wgs)
mt = mt.select_entries(GT = mt.GT.n_alt_alleles())
mt = mt.select_rows(rsid = mt.rsid)
  1. Next, I filtered the samples. Is there anything I can simplify in this step?
sample = hl.import_table("XXXX", 
                         missing='',
                        impute=True,
                        types = {"person_id": "str"})
sample = sample.key_by("person_id")
mt = mt.semi_join_cols(sample)
mt = mt.annotate_cols(**sample[mt.s])
  1. This step, which uses hl.agg.sum, seems more computationally intensive. Any suggestions for optimization?
pgs = hl.struct(pgs1 = hl.agg.sum(mt.sumstats.beta_thresh1 * mt.GT), ...)))
  1. I’m having difficulty optimizing the last few pieces of code. For WGS-scale data, how many partitions should I use? Should I prefer mt.repartition or mt.naive_coalesce? (I used 2 workers and 100 preemptible workers, if that helps with the calculation.)
# Benchmarking commands:
mt = mt.repartition(2000)
mt = mt.repartition(2000, shuffle=False)
mt = mt.naive_coalesce(2000)
  1. I save the Hail table as a checkpoint for now, but I’ll eventually export it as tsv.bgz. Anything here can be adjusted to save memory?
sample_info = sample_info.checkpoint(export_filename, overwrite=True)