What are disk and memory requirements for loading 200K UKBB VCFs into MT

I am trying to load all 1000 or so UKBB 200K Exome VCF files onto an AWS EMR cluster and would like to learn what the recommended mem and disk sizes should be for the various parts of the EMR cluster.

The 200K Exome data is about 7.5 TB of compressed VCF files which I suspect will be something equivalent in size of the Hail MT…

I am loading the VCF files from S3 bucket and writing the MT to S3 as well, but what are the requirements in mem and disk space for the following parts of Spark cluster.

Will all the 7.5 TB Hail MT be on the MASTER server, so should i make sure that has 10TB of connected disk? Or can I “spread it around” on the CORE and/OR TASK servers?

Also, what are the required memory sizes for each of the servers? 64GB enough?
I am thinking of the following

  1. MASTER (Spark dirver) - 2 instances - 64GB Mem - 250GB disk space (?)
  2. CORE - 2 isntances - 64GB Mem - 500GB disk space
    3 TASK - Between 2-30 instances 64GB Mem - 500GB disk space

If the data can be spread out over the task servers, there is enough space for 15TB so should fit all of Hail MT, but is that how it works or will the MASTER collect all the data for MT and does that have a 10TB disk?

Thanks for the help in advance!

Thon

Sorry this got missed!

The answer is that there really isn’t any minimum cluster size for most pipelines. Hail doesn’t localize datasets in memory or local disk, it streams through the data, typically reading from and writing to object stores like S3 / google storage / azure blob storage / etc. The size of your dataset determines the runtime, but not the peak memory usage (though the memory usage will scale with the number of samples, since we process a few rows of data together at a time).

Great, so you are saying, that even the hl.import_vcf step is lazy and only actually loads the data, until it either is used or written to disk with mt.write? So, if I give it all 8.8 TB in the hl.import_vcf step, it will not load the data? I tried it, and it seems that hl.import_vcf is definitely trying to load the data as soon as I execute it in the notebook so where is it storing the matrix table then, if not in memory?

I think import_vcf is doing a query to look at whether the data is already sorted (a Hail requirement for efficient joins). It’s not actually loading the data in memory, though.

1 Like