Phenotype prediction

jbloom · October 5, 2017, 5:27pm

Hi Dave, you’re not overlooking it! Thanks for the pointer, looks like these methods start from GWAS summary statistics and then leverage R packages for regularized regression. We don’t have near term plans to incorporate them ourselves (in part because the urgency on scale is less so with summary stats), but we are working to make big linear algebra more flexible/exposed/performant to make it easier to implement such methods, and to make VDS more generic to handle other data types like tons of functional phenotypes treated as a matrix rather than a table. A simple thing you can do already is linearly predict risk from betas obtained internally or externally by annotating samples with an expression like this:

sa.polyRisk = gs.map(g => g.gt.toDouble.orElse(2 * va.AF) * va.beta).sum()

Another approach may be to munge your data in Hail and then leverage the ML functionality in PySpark, see for example:

Topic		Replies	Views
Using Spark ML to create and apply Random Forests Help [0.1]	2	2463	April 26, 2018
Visualization and analytics frontend Hail Query & hailctl	2	640	June 6, 2020
Use-case for hail at our Institute Science	7	1056	June 20, 2019
Regression with multiple phenotypes with varying degrees of missingness Hail Query & hailctl	5	655	April 27, 2020
GWAS on subset of UKBioBank Hail Query & hailctl	26	1524	July 13, 2021

Phenotype prediction

Related topics