Hail sounds like a very pleasant way to work with large sequence data sets. I however intent to work on a large N common variant data sets (UKB).
So can I expect simular speed gains over existing software as is the case for rare variant analysis?
Is there a tutorial on getting Hail to work on the LISA/Surfsara/Cartesius cluster (I ask because I know a lot of PGC work is done on Lisa and someone may already have installed HAIL there)?
Hi Michel,
In its current state, Hail is actually far more useful for common variant association than rare variant association. We currently have no formal RVAS methods implemented (though these will come soon), but for GWAS, we have linear regression, logistic regression, and linear mixed regression. Not only can Hail easily scale to large clusters to increase speed, but on a single core Hail is actually faster than PLINK for linear regression with covariates:
This plot was produced in October, so Hail is probably even faster now!
In response to your question about setting up Hail, the proper question is whether Spark is configured to run on LISA. I think the IT department managing the cluster would be the best place to look for answers.
Linear and logistic regression should scale well even for UKBB. However, linear mixed regression is currently optimzed to handle, say, 10k whole genomes, 100K variants (or more) in kinship, and association on 20M (or more) variants. It won’t run on UKBB due to limitations of current implementation as discussed in the docs (the kinship matrix is calculated in parallel, collected to master, and then eigen-decomposed as a local matrix, with the result broadcast to workers). We have plans for implementations that should scale lmmreg to UKBB scale sample sizes in the future.
Hail’s linear regression with several covariates is faster than PLINK 1.9 on one core due to a trick which effectively projects out sample covariates from the genotype vector per variant via a dense-matrix / sparse-vector multiply, reducing the per-variant computation to a simple correlation. If you’re interested, the code is here.
The code is hard to follow given I am not familiar with Python (or ithis might be writen in a language other than python, idk). The math seems like a variation on linear regression trough QR decomposition? Where does the variable xx in line 65 come from and what is its value?
The language is Scala (plus Breeze and Spark) and that’s right, QR decomposition is used to find the projection matrix Q transpose from n-dim space (with n samples) onto the k-dimensional vector space spanned by the k sample covariate vectors. xx is shorthand for the dot product of x and x, xxp is the dot project of x and its projection xp, etc. Why this math works is not transparent, so I should really write up a mathematical description as well. I’ll let you know here when I do!
No it’s not apparent, but your comments give me a basis to rewrite this in R, compare to basic OLS matrix algebra and begin too mess around to enhance my understanding, thanks.
Could you point me to the place in the code where you run logistic regression, any speed ups there would be most relevant to me.
The internals of all four logistic models are here, with some performance comparison at the end of this post. The projection trick does not extend to logistic and genotype sparsity doesn’t help much as the sample covariates are dense and used alongside in each iteration. We’ve also tried using QR and triangular solve in Newton iteration to avoid direct inversion of the Hessian (Fisher info), but found this does worse, likely because the number of covariates is tiny compared to the number of samples. If you really want to optimize single-core performance, check out the vectorization tricks in TopCoder competition. In the end, logistic regression is per variant and scales beautifully with cores, so does not seem to be a pressing computational bottleneck for Hail (though we may circle back to make it more efficient in the future).