What makes hail go fast locally?

Hi,

I want to make Hail go fast(er) locally when doing exome-seq analysis (cleaning, annotating, PCA etc)?

Say from a base of:
10 core 20 thread
128GB ram
and all data on SSD?

Where should I spend more money: more ram, higher speed cores, more cores or faster SSD?

Best,
Michel

Hi Michel! Good to hear from you!

This is a good question. We don’t have a ton of experience running Hail/Spark on large single instances, but I think as long as you’ve got at least ~4-6G memory per hyperthreaded core, you’ll be fine from a memory standpoint. A faster SSD won’t help much with a small core count, but disk bandwidth could become limiting if you’ve many 10s of cores reading from disk all at once. I don’t have the information to recommend spending on cores vs disk, but I think with a decent SSD you should be able to saturate Hail running on at least 32 and probably 64 threads for most tasks.

P.S. we may be interested in talking with you for advice on getting S.E.M. into Hail in the future!