Phenotype prediction


Hey all,

Unless I’m overlooking it, I don’t currently see methods that emphasize prediction of phenotypes over computing variant associations. For example, Alexander Gusev’s FUSION package ( for running transcriptome-wide association studies uses this machinery and includes several well-established strategies such as BLUP and elastic nets. Do you plan to include this class of methods in the future?



Hi Dave, you’re not overlooking it! Thanks for the pointer, looks like these methods start from GWAS summary statistics and then leverage R packages for regularized regression. We don’t have near term plans to incorporate them ourselves (in part because the urgency on scale is less so with summary stats), but we are working to make big linear algebra more flexible/exposed/performant to make it easier to implement such methods, and to make VDS more generic to handle other data types like tons of functional phenotypes treated as a matrix rather than a table. A simple thing you can do already is linearly predict risk from betas obtained internally or externally by annotating samples with an expression like this:

sa.polyRisk = => * va.AF) * va.beta).sum()

Another approach may be to munge your data in Hail and then leverage the ML functionality in PySpark, see for example:


Interesting, I’m not familiar with how to use summary statistics to do this. I was thinking of the scenario where you’re working with the genotypes, and you want to predict a phenotype. Since you typically have far more variants than samples, you need to regularize aggressively, so something like an elastic net is a reasonable approach. Maybe a more relevant link would be PrediXcan: