I have been playing with cloudtools, and have provisioned a test google hail cluster using cloudtools :
cluster start testcluster -p 6
- this seems adequate for reading / writing the test 1kg data supplied in the hail package. At least, my tests are working. Yay!
I have 2 / 3 vcf’s of wgs I want to read in and process, in increasing order of difficulty:
-
A little test set of ~ 500 variants x 20,000 samples
-
A moderate tests set of 1 chromosome’s worth of the above data, so I’m guessing ~10m variants x 20,000 samples
-
The whole genome’s worth of data (~180m unfiltered variants x 20k samples)
In all three cases, the basic initial processing involves
Annotating on a table of VQSR scores
Annotating sample info
Adding hail-standard variant / sample
Applying hard-filters based on vqslod and various other site / sample metrics
Running pca to check effectiveness of hard-filters (larger the dataset, the better, I guess)
Running a basic gwas.
Clearly a vanilla test cluster with two workers and 4 pre-emptible workers isn’t going to cut it. What should we be provisioning here to get effective response times for each step in the process for (say) a single chromosome? I realise it’s a vague question, but I’m happy to have a dialogue. If there are further task-specific parameters I should be setting to get best use of lots of cores, then is there anywhere that best practice is written down?
Note: I have seen this answer: Cluster size for using Hail on AWS? indicate a few thousand cores for a smaller problem, with a dropoff > 5k cores. I am also puzzled about where to specify the partitions / core, and what to set that parameter to (this has come up before).
Thanks
Vivek