Hi,
I’m working with WGS data to compute PGS for a few phenotypes, but I encountered an issue where the kernel died during the job. I’ve benchmarked a few parameters with a small subset, but it doesn’t seem scalable or reproducible for the full dataset. Any suggestions on which options below might prevent this would be greatly appreciated.
- When initializing Hail, should I run the first or second command? I have 16 CPUs and 60GB RAM in my cloud compute profile, but only 4 CPUs and 15GB RAM for the worker configuration in my All of Us environment.
hl.init(default_reference = "GRCh38", spark_conf={'spark.driver.memory': '60g'})
hl.init(default_reference = "GRCh38")
- I filtered the WGS matrix table with QC’ed variants and retrieved relevant fields for downstream analysis. Is the order of the following three commands optimal?
mt = mt.semi_join_rows(var_wgs)
mt = mt.select_entries(GT = mt.GT.n_alt_alleles())
mt = mt.select_rows(rsid = mt.rsid)
- Next, I filtered the samples. Is there anything I can simplify in this step?
sample = hl.import_table("XXXX",
missing='',
impute=True,
types = {"person_id": "str"})
sample = sample.key_by("person_id")
mt = mt.semi_join_cols(sample)
mt = mt.annotate_cols(**sample[mt.s])
- This step, which uses
hl.agg.sum
, seems more computationally intensive. Any suggestions for optimization?
pgs = hl.struct(pgs1 = hl.agg.sum(mt.sumstats.beta_thresh1 * mt.GT), ...)))
- I’m having difficulty optimizing the last few pieces of code. For WGS-scale data, how many partitions should I use? Should I prefer
mt.repartition
ormt.naive_coalesce
? (I used 2 workers and 100 preemptible workers, if that helps with the calculation.)
# Benchmarking commands:
mt = mt.repartition(2000)
mt = mt.repartition(2000, shuffle=False)
mt = mt.naive_coalesce(2000)
- I save the Hail table as a checkpoint for now, but I’ll eventually export it as tsv.bgz. Anything here can be adjusted to save memory?
sample_info = sample_info.checkpoint(export_filename, overwrite=True)