How do I get an odds ratio from Hail's logistic regression results?

danking · October 28, 2019, 5:31pm

We ran:

gwas = hl.logistic_regression_rows(
    test = 'lrt',
    y = mt.pheno.Flag,
    x = mt.GT.n_alt_alleles(),
    covariates=[1.0]
)

Which produces a bunch of data fields, but not an odds ratio!

In[1]: gwas.show()
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| locus         | alleles    |      beta | chi_sq_stat |  p_value | fit.n_iterations | fit.converged | fit.exploded |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| locus<GRCh37> | array<str> |   float64 |     float64 |  float64 |            int32 |          bool |         bool |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| 1:1           | ["A","C"]  |  1.50e+00 |    1.88e+00 | 1.70e-01 |                6 |          true |        false |
| 1:2           | ["A","C"]  |  5.43e-01 |    2.15e-01 | 6.43e-01 |                4 |          true |        false |
| 1:3           | ["A","C"]  | -2.23e-01 |    2.24e-02 | 8.81e-01 |                4 |          true |        false |
| 1:4           | ["A","C"]  |  8.44e-01 |    8.18e-01 | 3.66e-01 |                5 |          true |        false |
| 1:5           | ["A","C"]  | -9.81e-01 |    4.83e-01 | 4.87e-01 |                4 |          true |        false |
| 1:6           | ["A","C"]  |  2.27e-01 |    4.46e-02 | 8.33e-01 |                4 |          true |        false |
| 1:7           | ["A","C"]  | -6.93e-01 |    6.22e-01 | 4.30e-01 |                4 |          true |        false |
| 1:8           | ["A","C"]  | -2.13e-01 |    4.28e-02 | 8.36e-01 |                4 |          true |        false |
| 1:9           | ["A","C"]  |        NA |          NA |       NA |               26 |         false |        false |
| 1:10          | ["A","C"]  |        NA |          NA |       NA |               26 |         false |        false |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+

How do I get an odds ratio from these fields?

danking · October 28, 2019, 8:09pm

Answer

gwas = gwas.annotate_rows(odds_ratio = hl.exp(mt.beta))

Intuition

(thanks to @jbloom)

First, a couple definitions:

The odds of an event is the ratio of the probability of that event occurring to the probability of that event not occurring. If the probability of an event is p, the odds of that event is p / (1 - p).
The log odds of an event is simply the logarithm of the odds: \mathrm{log}(p / (1 - p))

Linear regression models the outcome, y, as a linear function of the covariates. For example, in simple linear regression, the slope (coefficient on x) is the change in the value of y when x is increased by 1.

Analogously, logistic regression, by its very definition, models the log odds of success (e.g. case status) as a linear function of the covariates. So each coefficient may be interpreted as the predicted change in the log odds when increasing that covariate by 1.

For GWAS, \beta_{GT} is then the predicted change in the log odds of a case when increasing the number of alternate alleles by 1.

We can convert from log odds to odds by exponentiating; therefore, \mathrm{exp}(\beta_{GT}) is the expected increase (or decrease) in the odds. Note that, in genetics, people use the term “odds ratio” to refer to the change in the odds when increasing the alternate alleles by 1.

Algebra

We need a few more definitions before we can fully work out the algebra:

The odds ratio of an event given an exposure is the ratio of the odds of the event if the exposure occurred to the odds of the event if the exposure did not occur. Formulaically: \mathrm{odds}(o | e) / \mathrm{odds}(o | \mathrm{not} \space e)
\mathrm{Prob}(e) is the probability of some event e
a function, f is called the inverse of a function g when f(g(x)) = x = g(f(x))
\mathrm{log}(x) is the natural logarithm
\mathrm{exp}(x) is the natural exponential function, which is also the inverse of the natural logarithm
\mathrm{logit}(x) = \mathrm{log}(x / (1 - x)) is the logit function
\mathrm{sigmoid}(x) is the sigmoid function, which is also the inverse of the logit function
The beta field produced by logistic_regression_rows is the value which maximizes the probability of the data in this model (cov_i is the ith element of covariates):

\mathrm{Prob}(y | x, cov_1, \cdots cov_n) = \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 * cov_1 + \cdots + \beta_n * cov_n + \varepsilon)

Let’s collect the formulas:

\begin{aligned} \mathrm{logit}(x) &= \mathrm{log}(\frac{x}{1 - x}) \\ \\ \mathrm{odds(e)} &= \frac{\mathrm{Prob}(e)}{1 - \mathrm{Prob}(e)} = \mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(e))) \\ \\ \mathrm{oddsRatio}(o | e) &= \frac{\mathrm{odds}(o|e)}{\mathrm{odds}(o| \mathrm{not} \space e)} \\ \\ \mathrm{Prob}(y | x, cov_1 \cdots cov_n) &= \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon) \end{aligned}

When we discuss odds ratio in genetics, we’re usually interested in the odds ratio of a binary phenotype, such as case/control status, given the presence of one alternate allele versus no alternate alleles. Assuming \mathrm{GT} is the number of alternate alleles at a given site, we can express the odds ratio mathematically like this:

\mathrm{oddsRatio}(\texttt{is\_case} | \mathrm{GT \space is \space one, not \space zero}) = \frac{\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 1)} {\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 0)}

We can now mechanically derive the definition of odds ratio in terms of the logistic regression coefficient, beta. We start by substituting the definition for \mathrm{odds}:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 1)))} {\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 0)))}

Substitute the definition for \mathrm{Prob}(\texttt{is\_case} | \mathrm{GT}) (this derivation works for any number of covariates, as we see below). Note that we replace x from the definition with the value of \mathrm{GT}, namely 1 and 0:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))} {\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))}

Recall that logit and sigmoid are inverses, so we can remove them:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)} {\mathrm{exp}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)}

We can apply the law of exponents to simplify to:

\begin{aligned} \mathrm{oddsRatio}(\cdots) = \mathrm{exp}&( \beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon \\ &- (\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)) \\ = \mathrm{exp}&(\beta_{GT} * 1) \\ = \mathrm{exp}&(\beta_{GT}) \end{aligned}

Topic		Replies	Views
Logistic regression on remote servers Hail Query & hailctl	1	431	October 14, 2020
Missing value and logistic regression Hail Query & hailctl	5	786	October 2, 2020
Error summary: HailException Hail Query & hailctl	4	543	September 26, 2022
Requesting advice on efficiently parsing through many GWAS results Hail Query & hailctl	8	588	June 14, 2022
Logistic regression implementation Hail Query & hailctl	4	804	September 23, 2020

How do I get an odds ratio from Hail's logistic regression results?

Answer

Intuition

Algebra

Further Reading

Related topics