I would like to have an estimate of the computing cost of QC and association analyses of ~5K samles (WES and WGS) using HAIL on the google cloud. Is there an easy way to estimate the compute cost and storage cost of such a dataset. Or would there be an easy way to estimate based on exac/gnomad experience? Thanks!
Storage cost is easy to estimate – $25 per terabyte per month for Google buckets. You’ll probably be storing one copy of the original VCFs and one copy of Hail native files, which are slightly smaller.
Compute cost is harder to estimate, because it’ll drastically depend on what you’re doing. Do you have a rough plan for what you want to do for QC and association?
I don’t have a detailed description of the QC pipeline, but we would like to apply something similar to what has been done with exac/gnomad.
Thanks for the quick answer!
If you’re applying cuts on sample/variant/genotype fields, then that’s super computationally easy. The case where it gets expensive is when you’re doing iterative analysis trying to understand how those cuts are affecting analysis quality. This is how the gnomAD team does QC - they need to understand the data in order to make the cuts. It’s hard to estimate a cost for this, but it’ll probably cost on the order of tens of dollars for each iteration.