What google cluster parameters for moderate scale wgs work?

I have been playing with cloudtools, and have provisioned a test google hail cluster using cloudtools :

cluster start testcluster -p 6

  • this seems adequate for reading / writing the test 1kg data supplied in the hail package. At least, my tests are working. Yay!

I have 2 / 3 vcf’s of wgs I want to read in and process, in increasing order of difficulty:

  1. A little test set of ~ 500 variants x 20,000 samples

  2. A moderate tests set of 1 chromosome’s worth of the above data, so I’m guessing ~10m variants x 20,000 samples

  3. The whole genome’s worth of data (~180m unfiltered variants x 20k samples)

In all three cases, the basic initial processing involves

Annotating on a table of VQSR scores
Annotating sample info
Adding hail-standard variant / sample

Applying hard-filters based on vqslod and various other site / sample metrics

Running pca to check effectiveness of hard-filters (larger the dataset, the better, I guess)

Running a basic gwas.

Clearly a vanilla test cluster with two workers and 4 pre-emptible workers isn’t going to cut it. What should we be provisioning here to get effective response times for each step in the process for (say) a single chromosome? I realise it’s a vague question, but I’m happy to have a dialogue. If there are further task-specific parameters I should be setting to get best use of lots of cores, then is there anywhere that best practice is written down?

Note: I have seen this answer: Cluster size for using Hail on AWS? indicate a few thousand cores for a smaller problem, with a dropoff > 5k cores. I am also puzzled about where to specify the partitions / core, and what to set that parameter to (this has come up before).

Thanks

Vivek

It’s amazing how fast these data are growing – 18 months ago, datasets with 20K WGS were the largest anyone had tried to process, but now they’re “moderate scale”!

It’s hard to give concrete numbers here, but I think that ~2000 cores should work fine for the whole genome. Reducing that to ~500-1k cores for the single chromosome might be a bit more efficient.

The only piece of your processing that shouldn’t scale trivially is PCA. For PCA, it’s generally advised to use a small number of common and LD-pruned sites, but LD prune is quite slow right now, so you may not want to prune in Hail. PCA is very sensitive to data partitioning, so I’d recommend repartitioning the data to decrease the number of partitions prior to going into PCA – happy to suggest tweaks for the pipeline if you post it.