Hardware requirements

olszewskip · October 22, 2020, 9:01am

Hi!
I would like to ask for Your guidance and any tips regarding the following question:

What reasonable hardware requirements should I meet if I’d like to setup an “on premise” server for the purpose of running hail 0.2 on a WES VCF with the number of samples of the order of 2k.
What could be the most likely bottleneck for gwas-like analysis (QC, ld-pruning, logistic regression)?

I don’t have much in-depth understanding about internals of Spark so any help would be greatly appreciated.

tpoterba · October 22, 2020, 1:28pm

This is going to be about ~20G vcf.gz, right?

I’ve used similar-size data for experiments on a macbook pro a couple of years ago (8 vcores, 16G RAM). Anything bigger than that will just offer a better experience – more cores will just make things go faster.

An SSD will help a great deal as well, since Spark stores temporary data on disk.

olszewskip · October 22, 2020, 9:02pm

Roughly… I was thinking less than 300k variants, and hoping that it would be less than 10G vcf.bgz actually. Is that coherent?
So I shoudn’t be able to run out of RAM already at 16G? I was more pessimistic. Thanks for the good news!

tpoterba · October 24, 2020, 7:31pm

no, definitely not! Hail’s memory requirements don’t scale with the on-disk size of the dataset, instead it streams through the data a few rows at a time.

olszewskip · October 25, 2020, 6:26pm

still, speaking as a person who doesn’t always know what he is doing it, it is possible to run out of memory doing e.g. hwe_normalized_pca(k=a_lot) or [mt.group_rows_by(mt.locus, mt.alleles).aggregate(GT=hl.agg.collect(mt.GT).head()).count() for _ in range(a_lot)]

tpoterba · October 25, 2020, 6:34pm

Certainly there are some operations that require more memory than one row. When results are sent from Hail’s backend to Python, those are entirely in memory. This means that things like ht.collect() can easily cause out-of-memory exceptions.

Right now the column values (sample ids + annotations) are also stored in memory, so adding a lot of data there will increase memory usage. Users have encountered this in datasets like the UK Biobank (many samples * many phenotypes = lots of column data).

Your two examples here should require minimal memory. Note, though, that our current PCA implementation uses a library in Spark that isn’t designed to handle huge K and so Hail’s PCA shouldn’t be used for a full eigendecomposition.

Topic		Replies	Views
Memory issue in Hail Help [0.1]	12	1500	September 20, 2017
Heap out of memory Hail Query & hailctl	14	1819	July 21, 2020
Large scale ingest Hail Query & hailctl	7	696	April 8, 2019
What are disk and memory requirements for loading 200K UKBB VCFs into MT Hail Batch & General Cloud	3	514	November 10, 2021
What makes hail go fast locally? Hail Query & hailctl	1	412	November 2, 2020

Hardware requirements

Related topics