Hardware requirements

I would like to ask for Your guidance and any tips regarding the following question:

  1. What reasonable hardware requirements should I meet if I’d like to setup an “on premise” server for the purpose of running hail 0.2 on a WES VCF with the number of samples of the order of 2k.
  2. What could be the most likely bottleneck for gwas-like analysis (QC, ld-pruning, logistic regression)?

I don’t have much in-depth understanding about internals of Spark so any help would be greatly appreciated.

This is going to be about ~20G vcf.gz, right?

I’ve used similar-size data for experiments on a macbook pro a couple of years ago (8 vcores, 16G RAM). Anything bigger than that will just offer a better experience – more cores will just make things go faster.

An SSD will help a great deal as well, since Spark stores temporary data on disk.

1 Like

Roughly… I was thinking less than 300k variants, and hoping that it would be less than 10G vcf.bgz actually. Is that coherent?
So I shoudn’t be able to run out of RAM already at 16G? I was more pessimistic. Thanks for the good news! :slight_smile:

no, definitely not! Hail’s memory requirements don’t scale with the on-disk size of the dataset, instead it streams through the data a few rows at a time.

still, speaking as a person who doesn’t always know what he is doing it, it is possible to run out of memory doing e.g. hwe_normalized_pca(k=a_lot) or [mt.group_rows_by(mt.locus, mt.alleles).aggregate(GT=hl.agg.collect(mt.GT).head()).count() for _ in range(a_lot)] :slight_smile:

Certainly there are some operations that require more memory than one row. When results are sent from Hail’s backend to Python, those are entirely in memory. This means that things like ht.collect() can easily cause out-of-memory exceptions.

Right now the column values (sample ids + annotations) are also stored in memory, so adding a lot of data there will increase memory usage. Users have encountered this in datasets like the UK Biobank (many samples * many phenotypes = lots of column data).

Your two examples here should require minimal memory. Note, though, that our current PCA implementation uses a library in Spark that isn’t designed to handle huge K and so Hail’s PCA shouldn’t be used for a full eigendecomposition.