I would like to have an estimate of the computing cost of QC and association analyses of ~5K samles (WES and WGS) using HAIL on the google cloud. Is there an easy way to estimate the compute cost and storage cost of such a dataset. Or would there be an easy way to estimate based on exac/gnomad experience? Thanks!

Storage cost is easy to estimate â€“ $25 per terabyte per month for Google buckets. Youâ€™ll probably be storing one copy of the original VCFs and one copy of Hail native files, which are slightly smaller.

Compute cost is harder to estimate, because itâ€™ll drastically depend on what youâ€™re doing. Do you have a rough plan for what you want to do for QC and association?

I donâ€™t have a detailed description of the QC pipeline, but we would like to apply something similar to what has been done with exac/gnomad.

Thanks for the quick answer!

Josep

If youâ€™re applying cuts on sample/variant/genotype fields, then thatâ€™s super computationally easy. The case where it gets expensive is when youâ€™re doing iterative analysis trying to understand how those cuts are affecting analysis quality. This is how the gnomAD team does QC - they need to understand the data in order to make the cuts. Itâ€™s hard to estimate a cost for this, but itâ€™ll probably cost on the order of tens of dollars for each iteration.

Thanks!