I am trying to load all 1000 or so UKBB 200K Exome VCF files onto an AWS EMR cluster and would like to learn what the recommended mem and disk sizes should be for the various parts of the EMR cluster.
The 200K Exome data is about 7.5 TB of compressed VCF files which I suspect will be something equivalent in size of the Hail MT…
I am loading the VCF files from S3 bucket and writing the MT to S3 as well, but what are the requirements in mem and disk space for the following parts of Spark cluster.
Will all the 7.5 TB Hail MT be on the MASTER server, so should i make sure that has 10TB of connected disk? Or can I “spread it around” on the CORE and/OR TASK servers?
Also, what are the required memory sizes for each of the servers? 64GB enough?
I am thinking of the following
- MASTER (Spark dirver) - 2 instances - 64GB Mem - 250GB disk space (?)
- CORE - 2 isntances - 64GB Mem - 500GB disk space
3 TASK - Between 2-30 instances 64GB Mem - 500GB disk space
If the data can be spread out over the task servers, there is enough space for 15TB so should fit all of Hail MT, but is that how it works or will the MASTER collect all the data for MT and does that have a 10TB disk?
Thanks for the help in advance!
Thon