Cluster size for using Hail on AWS?

ines · April 23, 2018, 9:07pm

Hi everyone,

I am working on data from the Simon Simplex collection which is a permanent repository of genetic samples from 2,600 simplex families.

I want to import and analyze his data on Hail using Amazon Cloud (AWS). What cluster size do I need?

Thank you !

jbloom · April 23, 2018, 9:48pm

Hi Ines,

Do I understand from the Simons website that you’ll be working with a VCF with 8,975 whole genomes?

For most operations, run time scales inversely to cluster size, so long as there are at least a few tasks (partitions) per core: by doubling the number of cores, you get results back nearly twice as fast. So it’s more a question the size you want rather than the size you need. Many users with data at your scale use 300 - 3000 cores. We’ve seen efficiency drop considerably using more (say 5000 cores) within a single cluster.

ines · April 24, 2018, 12:54pm

Hi jbloom,

thanks for your answer. It will help me.

More generally, I would like to know how we can estimate this size ? Are researchers working on the subject ? How to optimize efficiency and costs ?

jbloom · April 24, 2018, 1:53pm

We do not yet have detailed guidelines, largely because it’s a moving target as the infrastructure is still rapidly improving and the trade off between cost and scale is pipeline and data dependent. For example, we’ve just begun adding query optimizations to the compiler introduced in 0.2 (definitely start with this version). Long term we are very interested in providing more concrete guidelines, and more than that, in handling most/all aspects of optimization of cluster configuration automatically.

Before running a huge job, we recommend experimenting on a subset of your data to ensure your script is as intended and to get a sense of the needed core hours for the full job. If you’re worried about efficiency, see what effect doubling the cluster has on runtime. Using preemptible nodes can also save money.

In the end, whether to scale well beyond peak efficiency depended on your resources and how much you value your own time.

Topic		Replies	Views
What google cluster parameters for moderate scale wgs work? Hail Query & hailctl	1	585	May 28, 2019
Resource -> runtime question for large datasets Help [0.1]	13	1519	September 27, 2018
Recommended data node hardware for Hail Help [0.1]	1	866	October 8, 2018
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1187	December 1, 2022
Cluster Size for Subsetting in Hail Hail Query & hailctl	3	399	March 10, 2020

Cluster size for using Hail on AWS?

Related topics