We do not yet have detailed guidelines, largely because it’s a moving target as the infrastructure is still rapidly improving and the trade off between cost and scale is pipeline and data dependent. For example, we’ve just begun adding query optimizations to the compiler introduced in 0.2 (definitely start with this version). Long term we are very interested in providing more concrete guidelines, and more than that, in handling most/all aspects of optimization of cluster configuration automatically.
Before running a huge job, we recommend experimenting on a subset of your data to ensure your script is as intended and to get a sense of the needed core hours for the full job. If you’re worried about efficiency, see what effect doubling the cluster has on runtime. Using preemptible nodes can also save money.
In the end, whether to scale well beyond peak efficiency depended on your resources and how much you value your own time.