Ordinal Regression

Hail should support ordinal regression. Ordinal regression deals with a dependent variable whose values are finite, discrete and ordered. This differs from a categorical variable where each value can be represented by a binary variable. As I understand it, ordinal regression of a dependent variable y is phrased in two steps: mapping from the independent variable x to a threshold-space, then mapping from threshold-space to the dependent variable y. For example:

\theta = x * w + \varepsilon \\ y = \begin{cases} y_1 & \theta < \theta_1 \\ y_2 & \theta_1 \leq \theta \lt \theta_2 \\ y_3 & \theta_2 \leq \theta \end{cases}

The choices of \theta_i and w are fit according to some loss function. I do not fully understand the power of the intermediate threshold space. At the very least, it involves projecting the x vector onto a line which is then partitioned.

1 Like

Apologies for resurrecting the topic without being able to contribute much, but I’d just like to note my current personal interest in being able to use ordinal regression in Hail. Currently I’m using linear regression or logistic regression on data that should be analyzed with ordinal regression only because I’m to lazy to leave the comfort of Hail.

Here’s my understanding of possibly the simplest form of (linear) ordinal regression. I’ll try to stick with notation and conventions from Wikipedia (including one-based indexing, and using \theta - wx rather than wx - \theta), and (without really loosing much generality) I’ll choose sigmoid as the link function. Perhaps it will resonate with someone and help keep the thread going . Please let me know if there’s something wrong or dumb below :slight_smile: . Also, if anyone knows of a good textbook-like reference that derives some theoretical properties of such ordinal regression models, let me know as well!

Say (x, y) pair is a single n-dimensional observation x with its correct label y out of k possible ordinal labels.
The model is given by n coefficients w and k-1 ordered thresholds \theta_i.
More specifically, we have

x, w \in \R^n \\ y\in{1,2,\ldots,k} \\ \theta_i \in \R, i=1,2,\ldots,k-1 \,, \\ \forall_{i=1}^{k-2} \theta_i < \theta_{i+1} \\ \theta_0 = -\infty\,,\; \theta_k = \infty

and w and \theta_i, i=1,2,\ldots,k-1 are to be optimized when fitting the model.
(The \epsilon in the original post is spurious, I think, if none of the thresholds is fixed.)

To that end one can consider the following “cumulative probabilities”:

P(y \le i| x) \equiv \sigma(\theta_i - wx) \equiv P_i(x)\,,\;i=0,1,\ldots,k \\ P_0 = 0 \,,\; P_k = 1

and their differences:

P(y=i|x) = P(y \le i | x) - P(y\le i-1|x) = P_i - P_{i-1} \equiv p_i(x)\,,\;i=1,\ldots,k.

The label y^\star(x) that the model predicts, given the observation x, could be defined by one of (at least) two distinct options, of which the first one (already described in the original post) seems more correct to me:
a) y^\star(x) is given by i such that \theta_{i-1} < wx \le \theta_i, which is also the smallest i such that P_i \ge \frac{1}{2}.
b) y^\star(x) is given by i such that p_i is the maximum of \{p_i\,,\;i=1,\ldots,k\}.

Here’s an illustration for k=4:

Two fairly intuitive choices for the loss function, discussed by Rennie&Srebro, and implemented in the mord package are (this is still talking about a single observation):

A) “IT”, “Immediate-Threshold”:

\log L^{IT}(x, y) = -\left[ \log(1 - P_{y-1}) + \log P_y \right] =\\ -\sum_{i=1}^{k-1} \log \sigma(s_i(y)(\theta_i - wx)) \,,\; \text{where}\; s_i(y) = \left\{\begin{array}{ll} 0, & i < y - 1 \\ -1, & i = y - 1 \\ 1, & i = y \\ 0, & i > y \end{array} \right. \;

B) “AT”, “All-Threshold”:

\log L^{AT}(x, y) = -\left[\sum_{i=1}^{y-1} \log (1 - P_i) + \sum_{i=y}^{k-1} \log P_i(x) \right] =\\ -\sum_{i=1}^{k-1} \log \sigma(s_i(y)(\theta_i - wx)) \,,\; \text{where}\; s_i(y) = \left\{\begin{array}{ll} -1, & i < y \\ 1, & i \ge y \end{array} \right.

In both cases the loss is a sum of more than one equally important negative-log-likelihood-losses (or binary crossentropy losses) associated with binary classifiers tied to particular \theta_i thresholds. In A only the two thresholds immediately around y are taken, whereas in B all thresholds are used. Both approaches reduce to binary logistic regression when k=2 (I hope).

To infer w and \theta's, one would of course need more than one (x, y) observation. The loss function for a collection of observations is simply the sum of losses for individual observations.

My limited personal experience is that the “AT” loss gives me better models than “IT” (in the sense that they achieve better metrics on a test set etc.). I’ve also heard anecdotes by other people to the same effect. So, my vote would be to focus on “AT” as the default option.

The implementation in mord adds L2 regularization penalizing w. In Hail it would probably be preferred not to regularize. On the other hand there is no attempt at calculating p-values for w in mord. Another python implementation of a somewhat-similar-approach to the one in mord that I’ve stumbled on is: GitHub - Shopify/bevel: Ordinal regression in Python. Here, best I can tell, they optimize neither the “IT” or “AT” loss, but rather -\log p_y, i.e. the log-likelihood that is suggested by Wikipedia and (confusingly) the blog-post by mord's Author. (Consequently, their model outputs labels y^\star from (b) rather than (a) described above.) But, they do compute p-values for w. They use diagonal elements of inverted (numerically computed) hessian as variances for the weights. Given a variance std_j^2, the p-value for w_j comes from H_0 that w_j/std_j is normally distributed. I’m not 100% sure if that’s justified but it seems very similar to what’s often done in the case of linear and logistic regression. Also, I’m not sure if the hessian needs to be computed numerically (rather than by an analytical closed-form expression).

What do You think? Is there perhaps a chance that existing Hail infrastructure for logistic regression could be used to implement such an ordinal model?

Hey Pawel,

Thanks for resurrecting. After briefly skimming over the math, I think this is probably possible to write with Hail’s current ndarray infrastructure.

I very recently wrote a python version of hail’s logistic regression method. It’s in hail currently as _logistic_regression_rows_nd. It’s slower than the original logistic_regression_rows method, and it doesn’t implement all the modes that the original method does. My intention is to keep pushing on that until it can completely replace the original method, though I don’t yet have a sense of the timeline for that. You can find the method here:

If you were so inclined, I think you could build ordinal regression by doing something similar. It’s possible there’s a method or two on ndarrays you’d need, but I’d be happy to add them if you ran into that.

I don’t know what the performance characteristics would be, (currently my ndarray version of logreg is slower than the original), but we are continuing to work on ndarray performance, and so whatever it is it should at least improve over time.

Also happy to answer any questions about the linked code

1 Like

Thanks! This definitely sounds constructive. I will try to have an honest look and come back with questions.

Hi!
I haven’t yet made any progress with the ordinal regression, but I haven’t forgotten about it.
Just as a small sanity check and an unambicious proof to myself that I understand something about the code that @johnc1231 has linked to, I’ll try to summarize what I think is happening there with the binary logistic regression. I hope it’s okay that I’ll use this space as my scratchpad and the proverbial rubber duck. Hopefully, writing this down will make it easier - for me or anyone else - to generalize it to the ordinal regression.

So, basically one looks for the minimum of the negative-log-likelihood function by looking for the root of its first derivative. The latter is done with the Newton’s method, much as it is described in the documentation of hl.experimental.loop (Hail | Experimental), only in a multi-variable setting.

Some notation for data and the model:

i = 1,\ldots, n \quad \text{(observations)} \\ j = 1,\ldots, m \quad\text{(features)} \\ X = \left[ X_{ij} \right] \;,\quad X_{ij} \in \mathbb{R} \;,\\ \forall_i X_{i0} = 1 \quad \text{(the intercept)} \\ \beta = \left[ \beta_j \right] \\ y = \left[ y_i \right] \;,\quad y_i \in \{0, 1\} \;,\\ s = \left[ s_i \right] \;,\quad s_i = 2 y_i - 1 = \left\{\begin{array}{ll} -1, & y_i = 0 \\ 1, & y_i = 1 \end{array} \right. \\

Sigmoid, logit, binary probabilities and the total negative-log-likelihood l:

\sigma(x) = \frac{1}{1 + \exp(-x)} \;,\quad \sigma^{-1}(x) = \log \frac{x}{1 - x} \\ p^{+}_i = p^{+}_i(\beta, X) = \sigma\left(\sum_{j} X_{ij} \beta_j \right) \\ p_i = p_i(\beta, X, y) = \sigma\left(s_i \sum_{j} X_{ij} \beta_j \right) \\ l(\beta, X, y) = - \sum_i \left[ y_i \log p^{+}_i + (1 - y_i) \log(1 - p^{+}_i) \right] = - \sum_i \log p_i

We want to find \beta = \beta^{\infty} such that \frac{\partial l}{\partial \beta} \Big|_{\beta^{\infty}} \!\!= 0. We will recursively update a \beta^n vector starting from an initial guess \beta^0, producing a sequence (\beta^n) until convergence defined us |\beta^{n} - \beta^{n - 1}|_1 < \text{tol} (with some arbitrary small \text{tol}). The updating is done with the following rationale.

0_j \approx \frac{\partial l}{\partial \beta_j} \bigg|_{\beta^{n+1}} = \frac{\partial l}{\partial \beta_j}\bigg|_{\beta^n} + \sum_{k=1}^{m} \frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \bigg|_{\beta^n} \delta\beta^n_k + \ldots \\

so approximate \delta\beta^n by solving the linear system

A^n \cdot \delta\beta^n = B^n \,,\; \text{with} \\ A^n = \left[ A^n_{jk}\right] = \left[\frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \right]\bigg|_{\beta^n} \\ B^n = \left[B^n_j\right] = \left[ - \frac{\partial l}{\partial \beta_j} \right] \bigg|_{\beta^n}

and update the coeffients via \beta^{n+1} = \beta^n + \delta\beta^n.

The derivatives of the loss are

-\frac{\partial l}{\partial \beta_j} = \sum_i s_i X_{ij} ( 1- p_i) = \sum_i X_{ij} (y_i - p^{+}_i) \\ \frac{\partial^2l}{\partial\beta_j \partial \beta_k} = \sum_i X_{ij} X_{ik} p_i (1 - p_i) = \sum_i X_{ij} X_{ik} p^{+}_i (1 - p^{+}_i)

And a reasonable ansatz for \beta^0 is

\beta^0 = \left[\sigma^{-1}\!\!\left(\frac{\sum_i y_i}{n}\right), 0, \ldots, 0 \right]

which has the advantage that it satisfies \frac{\partial l}{\partial \beta_0} \Big|_{\beta^0} \!\!= 0, in light of the first column of X being all ones, X_{i0} = 1.

The code by @johnc1231 has those derivatives and the ansatz explicitly typed in.

I’ll generate some toy data in python and try to infer the betas using np.linalg.solve, scipy.optimize.minimize and hl.nd.solve, to see if I’ll get same result in the three cases.

import numpy as np
rng = np.random.default_rng(seed=42)
import scipy
import hail as hl
hl.init()
print(np.__version__, scipy.__version__, hl.__version__, sep=" | ")

1.20.2 | 1.6.2 | 0.2.69-6d2bd28a8849

def sigmoid(arr):
    result = np.empty(arr.shape, dtype='float')
    positive = arr > 0
    negative = ~positive
    result[positive] = 1 / (1 + np.exp(-arr[positive]))
    exp_arr = np.exp(arr[negative])
    result[negative] = exp_arr / (exp_arr + 1)
    return result

m = 5
n = 1234
_beta = np.array([-0.5, 0.1, 0.2, 1, 2])
X = np.concatenate([
    np.ones((n, 1)),
    rng.standard_normal((n, m-1))
], axis=1)

_Xbeta = X @ _beta
_p = sigmoid(_Xbeta)
y = rng.binomial(1, _p)

Here’s what this toy data looks like

Some definitions and the solution from numpy:

def logit(arr):
    return np.log(arr / (1 - arr))

def L(beta, X, y):
    s = 2 * y - 1
    sXbeta = s * (X @ beta)
    result = np.empty(sXbeta.shape, dtype='float')
    positive = sXbeta > 0
    result[positive] = np.log(1 + np.exp(-sXbeta[positive]))
    result[~positive] = -sXbeta[~positive] + np.log(1 + np.exp(sXbeta[~positive]))
    return result.sum()

def dLdbeta(beta, X, y):
    p = sigmoid(X @ beta)
    return X.T @ (p - y)

def d2Ldbeta2(beta, X, y=None):
    p = sigmoid(X @ beta)
    p1mp = p * (1 - p)
    XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
    return XXp1mp.sum(axis=0)

def get_beta_0(y, m=m):
    beta_0 = np.zeros(m, dtype='float')
    beta_0[0] = logit(y.mean())
    return beta_0

def delta_beta(beta, X, y):
    B = -dLdbeta(beta, X, y)
    A = d2Ldbeta2(beta, X)
    return np.linalg.solve(A, B)

def get_beta(beta, X, y, tol):
    delta = delta_beta(beta, X, y)
    beta = beta + delta
    if np.max(np.abs(delta)) < tol:
        return beta
    return get_beta(beta, X, y, tol)

beta_0 = get_beta_0(y)
beta_numpy = get_beta(beta_0, X, y, tol=1e-6)

Minimizing l with scipy:

sol = scipy.optimize.minimize(
    L,
    get_beta_0(y),
    jac=dLdbeta,
    hess=d2Ldbeta2,
    args=(X, y),
    tol=1e-6,
    method="Newton-CG"
)
beta_scipy = sol.x

And finally, doing the same thing as in numpy, but with hail.nd:

def hl_scalar_sigmoid(x):
    return hl.if_else(
        x > 0,
        1 / (1 + hl.exp(-x)),
        hl.rbind(
            hl.exp(x),
            lambda _: _ / (1 + _)
        )
    )

def hl_sigmoid(hl_nd):
    return hl_nd.map(hl_scalar_sigmoid)

def hl_dLdbeta(beta, X, y):
    p = hl_sigmoid(X @ beta)
    return X.T @ (p - y)

def hl_d2Ldbeta2(beta, X, y=None):
    p = hl_sigmoid(X @ beta)
    p1mp = p * (1 - p)
    XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
    return XXp1mp.sum(axis=0)

def hl_get_beta_0(hl_y, m=m):
    y_mean = hl.mean(
        hl_y._data_array()
    )
    beta_00 = hl.log(y_mean / (1 - y_mean))
    return hl.nd.hstack([
        hl.nd.array([beta_00]),
        hl.nd.zeros(
            m - 1,
            dtype=beta_00.dtype
        )
    ])

def hl_delta_beta(beta, X, y):
    B = -hl_dLdbeta(beta, X, y)
    A = hl_d2Ldbeta2(beta, X)
    return hl.nd.solve(A, B)

def _hl_get_beta(_, beta, X, y, tol):
    delta = hl_delta_beta(beta, X, y)
    beta = beta + delta
    converged = hl.max(
        hl.abs(delta)._data_array()
    ) < tol
    return hl.if_else(
        converged,
        beta,
        _(beta, X, y, tol)
    )

def hl_get_beta(beta, X, y, tol):
    return hl.experimental.loop(
        _hl_get_beta,
        hl.dtype('ndarray<float64, 1>'),
        beta, X, y, tol
    )

hl_X = hl.nd.array(X)
hl_y = hl.nd.array(y)
hl_beta_0 = hl_get_beta_0(hl_y)
beta_hail = hl_get_beta(hl_beta_0, hl_X, hl_y, 1e-6).collect()[0]

And it works!

assert(
    np.allclose(beta_numpy, beta_scipy) &\
    np.allclose(beta_numpy, beta_hail) &\
    np.allclose(beta_scipy, beta_hail)
)

So, other than stripping away the, arguably more important part, which parallelizes the regression onto rows of a matrix-table, uses a prefit null model and computes p-values, this is what I believe is also happening in @johnc1231’s code.

Again, let me know if I’ve said something false or dumb, or if I seem to be missunderstanding Hail’s interface! Thanks :slight_smile:

1 Like