How do I get an odds ratio from Hail's logistic regression results?

We ran:

gwas = hl.logistic_regression_rows(
    test = 'lrt',
    y = mt.pheno.Flag,
    x = mt.GT.n_alt_alleles(),
    covariates=[1.0]
)

Which produces a bunch of data fields, but not an odds ratio!

In[1]: gwas.show()
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| locus         | alleles    |      beta | chi_sq_stat |  p_value | fit.n_iterations | fit.converged | fit.exploded |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| locus<GRCh37> | array<str> |   float64 |     float64 |  float64 |            int32 |          bool |         bool |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+
| 1:1           | ["A","C"]  |  1.50e+00 |    1.88e+00 | 1.70e-01 |                6 |          true |        false |
| 1:2           | ["A","C"]  |  5.43e-01 |    2.15e-01 | 6.43e-01 |                4 |          true |        false |
| 1:3           | ["A","C"]  | -2.23e-01 |    2.24e-02 | 8.81e-01 |                4 |          true |        false |
| 1:4           | ["A","C"]  |  8.44e-01 |    8.18e-01 | 3.66e-01 |                5 |          true |        false |
| 1:5           | ["A","C"]  | -9.81e-01 |    4.83e-01 | 4.87e-01 |                4 |          true |        false |
| 1:6           | ["A","C"]  |  2.27e-01 |    4.46e-02 | 8.33e-01 |                4 |          true |        false |
| 1:7           | ["A","C"]  | -6.93e-01 |    6.22e-01 | 4.30e-01 |                4 |          true |        false |
| 1:8           | ["A","C"]  | -2.13e-01 |    4.28e-02 | 8.36e-01 |                4 |          true |        false |
| 1:9           | ["A","C"]  |        NA |          NA |       NA |               26 |         false |        false |
| 1:10          | ["A","C"]  |        NA |          NA |       NA |               26 |         false |        false |
+---------------+------------+-----------+-------------+----------+------------------+---------------+--------------+

How do I get an odds ratio from these fields?

1 Like

Answer

gwas = gwas.annotate_rows(odds_ratio = hl.exp(mt.beta))

Intuition

(thanks to @jbloom)

First, a couple definitions:

  • The odds of an event is the ratio of the probability of that event occurring to the probability of that event not occurring. If the probability of an event is p, the odds of that event is p / (1 - p).
  • The log odds of an event is simply the logarithm of the odds: \mathrm{log}(p / (1 - p))

Linear regression models the outcome, y, as a linear function of the covariates. For example, in simple linear regression, the slope (coefficient on x) is the change in the value of y when x is increased by 1.

Analogously, logistic regression, by its very definition, models the log odds of success (e.g. case status) as a linear function of the covariates. So each coefficient may be interpreted as the predicted change in the log odds when increasing that covariate by 1.

For GWAS, \beta_{GT} is then the predicted change in the log odds of a case when increasing the number of alternate alleles by 1.

We can convert from log odds to odds by exponentiating; therefore, \mathrm{exp}(\beta_{GT}) is the expected increase (or decrease) in the odds. Note that, in genetics, people use the term “odds ratio” to refer to the change in the odds when increasing the alternate alleles by 1.

Algebra

We need a few more definitions before we can fully work out the algebra:

  • The odds ratio of an event given an exposure is the ratio of the odds of the event if the exposure occurred to the odds of the event if the exposure did not occur. Formulaically: \mathrm{odds}(o | e) / \mathrm{odds}(o | \mathrm{not} \space e)
  • \mathrm{Prob}(e) is the probability of some event e
  • a function, f is called the inverse of a function g when f(g(x)) = x = g(f(x))
  • \mathrm{log}(x) is the natural logarithm
  • \mathrm{exp}(x) is the natural exponential function, which is also the inverse of the natural logarithm
  • \mathrm{logit}(x) = \mathrm{log}(x / (1 - x)) is the logit function
  • \mathrm{sigmoid}(x) is the sigmoid function, which is also the inverse of the logit function
  • The beta field produced by logistic_regression_rows is the value which maximizes the probability of the data in this model (cov_i is the ith element of covariates):
\mathrm{Prob}(y | x, cov_1, \cdots cov_n) = \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 * cov_1 + \cdots + \beta_n * cov_n + \varepsilon)

Let’s collect the formulas:

\begin{aligned} \mathrm{logit}(x) &= \mathrm{log}(\frac{x}{1 - x}) \\ \\ \mathrm{odds(e)} &= \frac{\mathrm{Prob}(e)}{1 - \mathrm{Prob}(e)} = \mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(e))) \\ \\ \mathrm{oddsRatio}(o | e) &= \frac{\mathrm{odds}(o|e)}{\mathrm{odds}(o| \mathrm{not} \space e)} \\ \\ \mathrm{Prob}(y | x, cov_1 \cdots cov_n) &= \mathrm{sigmoid}(\beta_{GT} * x + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon) \end{aligned}

When we discuss odds ratio in genetics, we’re usually interested in the odds ratio of a binary phenotype, such as case/control status, given the presence of one alternate allele versus no alternate alleles. Assuming \mathrm{GT} is the number of alternate alleles at a given site, we can express the odds ratio mathematically like this:

\mathrm{oddsRatio}(\texttt{is\_case} | \mathrm{GT \space is \space one, not \space zero}) = \frac{\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 1)} {\mathrm{odds}(\texttt{is\_case} | \mathrm{GT} = 0)}

We can now mechanically derive the definition of odds ratio in terms of the logistic regression coefficient, beta. We start by substituting the definition for \mathrm{odds}:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 1)))} {\mathrm{exp}(\mathrm{logit}(\mathrm{Prob}(\texttt{is\_case} | \mathrm{GT} = 0)))}

Substitute the definition for \mathrm{Prob}(\texttt{is\_case} | \mathrm{GT}) (this derivation works for any number of covariates, as we see below). Note that we replace x from the definition with the value of \mathrm{GT}, namely 1 and 0:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))} {\mathrm{exp}(\mathrm{logit}(\mathrm{sigmoid}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)))}

Recall that logit and sigmoid are inverses, so we can remove them:

\mathrm{oddsRatio}(\cdots) = \frac{\mathrm{exp}(\beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)} {\mathrm{exp}(\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)}

We can apply the law of exponents to simplify to:

\begin{aligned} \mathrm{oddsRatio}(\cdots) = \mathrm{exp}&( \beta_{GT} * 1 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon \\ &- (\beta_{GT} * 0 + \beta_1 cov_1 + \cdots + \beta_n cov_n + \varepsilon)) \\ = \mathrm{exp}&(\beta_{GT} * 1) \\ = \mathrm{exp}&(\beta_{GT}) \end{aligned}

Further Reading

  1. “Odds” https://en.wikipedia.org/wiki/Odds
  2. “Explaining Odds Ratios” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/
  3. “Logistic regression” https://en.wikipedia.org/wiki/Logistic_regression

Thanks @danking for your post,

Is it right also to add confidence interval (e.g. 95%) as follow (test=‘wald’)?

# Compute Odds ratio and 95% confidence interval from logistic regression results
tb_stats = tb_stats.annotate(odds_ratio=hl.exp(tb_stats.beta),
                             lower_ci_95=hl.exp(tb_stats.beta - 1.96 * tb_stats.standard_error),
                             upper_ci_95=hl.exp(tb_stats.beta + 1.96 * tb_stats.standard_error))

Best,

E.

Yes this is correct. A more detailed explanation of this can be found in this NCSS document.