Do I understand from the Simons website that you’ll be working with a VCF with 8,975 whole genomes?
For most operations, run time scales inversely to cluster size, so long as there are at least a few tasks (partitions) per core: by doubling the number of cores, you get results back nearly twice as fast. So it’s more a question the size you want rather than the size you need. Many users with data at your scale use 300 - 3000 cores. We’ve seen efficiency drop considerably using more (say 5000 cores) within a single cluster.
We do not yet have detailed guidelines, largely because it’s a moving target as the infrastructure is still rapidly improving and the trade off between cost and scale is pipeline and data dependent. For example, we’ve just begun adding query optimizations to the compiler introduced in 0.2 (definitely start with this version). Long term we are very interested in providing more concrete guidelines, and more than that, in handling most/all aspects of optimization of cluster configuration automatically.
Before running a huge job, we recommend experimenting on a subset of your data to ensure your script is as intended and to get a sense of the needed core hours for the full job. If you’re worried about efficiency, see what effect doubling the cluster has on runtime. Using preemptible nodes can also save money.
In the end, whether to scale well beyond peak efficiency depended on your resources and how much you value your own time.