Answer
gwas = gwas.annotate_rows(odds_ratio = hl.exp(mt.beta))
Intuition
(thanks to @jbloom)
First, a couple definitions:
- The odds of an event is the ratio of the probability of that event occurring to the probability of that event not occurring. If the probability of an event is p, the odds of that event is p / (1 - p).
- The log odds of an event is simply the logarithm of the odds: \mathrm{log}(p / (1 - p))
Linear regression models the outcome, y, as a linear function of the covariates. For example, in simple linear regression, the slope (coefficient on x) is the change in the value of y when x is increased by 1.
Analogously, logistic regression, by its very definition, models the log odds of success (e.g. case status) as a linear function of the covariates. So each coefficient may be interpreted as the predicted change in the log odds when increasing that covariate by 1.
For GWAS, \beta_{GT} is then the predicted change in the log odds of a case when increasing the number of alternate alleles by 1.
We can convert from log odds to odds by exponentiating; therefore, \mathrm{exp}(\beta_{GT}) is the expected increase (or decrease) in the odds. Note that, in genetics, people use the term âodds ratioâ to refer to the change in the odds when increasing the alternate alleles by 1.
Algebra
We need a few more definitions before we can fully work out the algebra:
- The odds ratio of an event given an exposure is the ratio of the odds of the event if the exposure occurred to the odds of the event if the exposure did not occur. Formulaically: \mathrm{odds}(o | e) / \mathrm{odds}(o | \mathrm{not} \space e)
-
\mathrm{Prob}(e) is the probability of some event e
- a function, f is called the inverse of a function g when f(g(x)) = x = g(f(x))
-
\mathrm{log}(x) is the natural logarithm
-
\mathrm{exp}(x) is the natural exponential function, which is also the inverse of the natural logarithm
-
\mathrm{logit}(x) = \mathrm{log}(x / (1 - x)) is the
logit
function
-
\mathrm{sigmoid}(x) is the
sigmoid
function, which is also the inverse of the logit function
- The
beta
field produced by logistic_regression_rows
is the value which maximizes the probability of the data in this model (cov_i is the ith element of covariates
):
\mathrm{Prob}(y | x, cov_1, \cdots cov_n)
= \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 * cov_1 + \cdots + \beta_n * cov_n + \varepsilon)
Letâs collect the formulas:
\begin{aligned}
\mathrm{logit}(x) &= \mathrm{log}(\frac{x}{1 - x}) \\ \\
\mathrm{odds(e)} &= \frac{\mathrm{Prob}(e)}{1 - \mathrm{Prob}(e)}
= \mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(e))) \\ \\
\mathrm{oddsRatio}(o | e) &= \frac{\mathrm{odds}(o|e)}{\mathrm{odds}(o| \mathrm{not} \space e)} \\ \\
\mathrm{Prob}(y | x, cov_1 \cdots cov_n)
&= \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)
\end{aligned}
When we discuss odds ratio in genetics, weâre usually interested in the odds ratio of a binary phenotype, such as case/control status, given the presence of one alternate allele versus no alternate alleles. Assuming \mathrm{GT} is the number of alternate alleles at a given site, we can express the odds ratio mathematically like this:
\mathrm{oddsRatio}(\texttt{is\_case} | \mathrm{GT \space is \space one, not \space zero})
= \frac{\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 1)}
{\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 0)}
We can now mechanically derive the definition of odds ratio in terms of the logistic regression coefficient, beta. We start by substituting the definition for \mathrm{odds}:
\mathrm{oddsRatio}(\cdots)
= \frac{\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 1)))}
{\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 0)))}
Substitute the definition for \mathrm{Prob}(\texttt{is\_case} | \mathrm{GT}) (this derivation works for any number of covariates, as we see below). Note that we replace x from the definition with the value of \mathrm{GT}, namely 1 and 0:
\mathrm{oddsRatio}(\cdots)
=
\frac{\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))}
{\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))}
Recall that logit and sigmoid are inverses, so we can remove them:
\mathrm{oddsRatio}(\cdots)
=
\frac{\mathrm{exp}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)}
{\mathrm{exp}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)}
We can apply the law of exponents to simplify to:
\begin{aligned}
\mathrm{oddsRatio}(\cdots)
=
\mathrm{exp}&( \beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon \\
&- (\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon))
\\
=
\mathrm{exp}&(\beta_{GT} * 1)
\\
=
\mathrm{exp}&(\beta_{GT})
\end{aligned}
Further Reading
- âOddsâ https://en.wikipedia.org/wiki/Odds
- âExplaining Odds Ratiosâ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/
- âLogistic regressionâ https://en.wikipedia.org/wiki/Logistic_regression