Estimating Large-Scale Storage Requirements


I am in the process of running our hail-based sample quality control script on roughly 470,000 exomes on the UK Biobank Research Analysis Platform. While running a test, my job failed due to insufficient storage space.

I saw this thread that suggested that the requirements for a full-scale job can often be linearly estimated from running a lower-scale test. Do you think this would be helpful in estimating the full-scale storage requirements for processing 470,000 exomes?

For context, these were the parameters used for running this job:
98 nodes, each node with 64 GB of memory and roughly 300 GB of storage. You can see in the screenshot below that the storage requirements reached the limit just prior to the job failing.

Unfortunately, the temporary JupyterLab environment closed shortly after failing, and I was unable to get the hail log file. Do you have any suggestions for setting a safe upper limit of storage that avoids future errors such as this one?

We are estimating the job to run for roughly 2-3 days. The storage requirements were roughly 130GB per node in the first 8 hours of running the job, but in the ninth hour, this would overcome the allocated 300GB of storage.

For reference, I am also sharing our updated Sample QC script.
hail_sample_qc_YL_MG_Step1_082823.txt (10.2 KB)