Estimating Large-Scale Storage Requirements

mgarcia · October 10, 2023, 1:04pm

Hello,

I am in the process of running our hail-based sample quality control script on roughly 470,000 exomes on the UK Biobank Research Analysis Platform. While running a test, my job failed due to insufficient storage space.

I saw this thread that suggested that the requirements for a full-scale job can often be linearly estimated from running a lower-scale test. Do you think this would be helpful in estimating the full-scale storage requirements for processing 470,000 exomes?

For context, these were the parameters used for running this job:
98 nodes, each node with 64 GB of memory and roughly 300 GB of storage. You can see in the screenshot below that the storage requirements reached the limit just prior to the job failing.

Unfortunately, the temporary JupyterLab environment closed shortly after failing, and I was unable to get the hail log file. Do you have any suggestions for setting a safe upper limit of storage that avoids future errors such as this one?

We are estimating the job to run for roughly 2-3 days. The storage requirements were roughly 130GB per node in the first 8 hours of running the job, but in the ninth hour, this would overcome the allocated 300GB of storage.

For reference, I am also sharing our updated Sample QC script.
hail_sample_qc_YL_MG_Step1_082823.txt (10.2 KB)

Topic		Replies	Views
Hardware requirements Hail Query & hailctl	5	426	October 25, 2020
Hail having difficulty scaling to 400K Development	2	47	June 9, 2025
Hail SampleQC Script Stuck Hail Query & hailctl	0	242	October 17, 2023
SampleQC Script Stalling on UKB RAP Hail Query & hailctl	3	449	July 27, 2023
Limit memory usage Hail Query & hailctl	11	1206	June 24, 2020

Estimating Large-Scale Storage Requirements

Related topics