Ordinal Regression

Hail should support ordinal regression. Ordinal regression deals with a dependent variable whose values are finite, discrete and ordered. This differs from a categorical variable where each value can be represented by a binary variable. As I understand it, ordinal regression of a dependent variable y is phrased in two steps: mapping from the independent variable x to a threshold-space, then mapping from threshold-space to the dependent variable y. For example:

\theta = x * w + \varepsilon \\ y = \begin{cases} y_1 & \theta < \theta_1 \\ y_2 & \theta_1 \leq \theta \lt \theta_2 \\ y_3 & \theta_2 \leq \theta \end{cases}

The choices of \theta_i and w are fit according to some loss function. I do not fully understand the power of the intermediate threshold space. At the very least, it involves projecting the x vector onto a line which is then partitioned.

1 Like

Apologies for resurrecting the topic without being able to contribute much, but Iā€™d just like to note my current personal interest in being able to use ordinal regression in Hail. Currently Iā€™m using linear regression or logistic regression on data that should be analyzed with ordinal regression only because Iā€™m to lazy to leave the comfort of Hail.

Hereā€™s my understanding of possibly the simplest form of (linear) ordinal regression. Iā€™ll try to stick with notation and conventions from Wikipedia (including one-based indexing, and using \theta - wx rather than wx - \theta), and (without really loosing much generality) Iā€™ll choose sigmoid as the link function. Perhaps it will resonate with someone and help keep the thread going . Please let me know if thereā€™s something wrong or dumb below :slight_smile: . Also, if anyone knows of a good textbook-like reference that derives some theoretical properties of such ordinal regression models, let me know as well!

Say (x, y) pair is a single n-dimensional observation x with its correct label y out of k possible ordinal labels.
The model is given by n coefficients w and k-1 ordered thresholds \theta_i.
More specifically, we have

x, w \in \R^n \\ y\in{1,2,\ldots,k} \\ \theta_i \in \R, i=1,2,\ldots,k-1 \,, \\ \forall_{i=1}^{k-2} \theta_i < \theta_{i+1} \\ \theta_0 = -\infty\,,\; \theta_k = \infty

and w and \theta_i, i=1,2,\ldots,k-1 are to be optimized when fitting the model.
(The \epsilon in the original post is spurious, I think, if none of the thresholds is fixed.)

To that end one can consider the following ā€œcumulative probabilitiesā€:

P(y \le i| x) \equiv \sigma(\theta_i - wx) \equiv P_i(x)\,,\;i=0,1,\ldots,k \\ P_0 = 0 \,,\; P_k = 1

and their differences:

P(y=i|x) = P(y \le i | x) - P(y\le i-1|x) = P_i - P_{i-1} \equiv p_i(x)\,,\;i=1,\ldots,k.

The label y^\star(x) that the model predicts, given the observation x, could be defined by one of (at least) two distinct options, of which the first one (already described in the original post) seems more correct to me:
a) y^\star(x) is given by i such that \theta_{i-1} < wx \le \theta_i, which is also the smallest i such that P_i \ge \frac{1}{2}.
b) y^\star(x) is given by i such that p_i is the maximum of \{p_i\,,\;i=1,\ldots,k\}.

Hereā€™s an illustration for k=4:

Two fairly intuitive choices for the loss function, discussed by Rennie&Srebro, and implemented in the mord package are (this is still talking about a single observation):

A) ā€œITā€, ā€œImmediate-Thresholdā€:

\log L^{IT}(x, y) = -\left[ \log(1 - P_{y-1}) + \log P_y \right] =\\ -\sum_{i=1}^{k-1} \log \sigma(s_i(y)(\theta_i - wx)) \,,\; \text{where}\; s_i(y) = \left\{\begin{array}{ll} 0, & i < y - 1 \\ -1, & i = y - 1 \\ 1, & i = y \\ 0, & i > y \end{array} \right. \;

B) ā€œATā€, ā€œAll-Thresholdā€:

\log L^{AT}(x, y) = -\left[\sum_{i=1}^{y-1} \log (1 - P_i) + \sum_{i=y}^{k-1} \log P_i(x) \right] =\\ -\sum_{i=1}^{k-1} \log \sigma(s_i(y)(\theta_i - wx)) \,,\; \text{where}\; s_i(y) = \left\{\begin{array}{ll} -1, & i < y \\ 1, & i \ge y \end{array} \right.

In both cases the loss is a sum of more than one equally important negative-log-likelihood-losses (or binary crossentropy losses) associated with binary classifiers tied to particular \theta_i thresholds. In A only the two thresholds immediately around y are taken, whereas in B all thresholds are used. Both approaches reduce to binary logistic regression when k=2 (I hope).

To infer w and \theta's, one would of course need more than one (x, y) observation. The loss function for a collection of observations is simply the sum of losses for individual observations.

My limited personal experience is that the ā€œATā€ loss gives me better models than ā€œITā€ (in the sense that they achieve better metrics on a test set etc.). Iā€™ve also heard anecdotes by other people to the same effect. So, my vote would be to focus on ā€œATā€ as the default option.

The implementation in mord adds L2 regularization penalizing w. In Hail it would probably be preferred not to regularize. On the other hand there is no attempt at calculating p-values for w in mord. Another python implementation of a somewhat-similar-approach to the one in mord that Iā€™ve stumbled on is: GitHub - Shopify/bevel: Ordinal regression in Python. Here, best I can tell, they optimize neither the ā€œITā€ or ā€œATā€ loss, but rather -\log p_y, i.e. the log-likelihood that is suggested by Wikipedia and (confusingly) the blog-post by mord's Author. (Consequently, their model outputs labels y^\star from (b) rather than (a) described above.) But, they do compute p-values for w. They use diagonal elements of inverted (numerically computed) hessian as variances for the weights. Given a variance std_j^2, the p-value for w_j comes from H_0 that w_j/std_j is normally distributed. Iā€™m not 100% sure if thatā€™s justified but it seems very similar to whatā€™s often done in the case of linear and logistic regression. Also, Iā€™m not sure if the hessian needs to be computed numerically (rather than by an analytical closed-form expression).

What do You think? Is there perhaps a chance that existing Hail infrastructure for logistic regression could be used to implement such an ordinal model?

Hey Pawel,

Thanks for resurrecting. After briefly skimming over the math, I think this is probably possible to write with Hailā€™s current ndarray infrastructure.

I very recently wrote a python version of hailā€™s logistic regression method. Itā€™s in hail currently as _logistic_regression_rows_nd. Itā€™s slower than the original logistic_regression_rows method, and it doesnā€™t implement all the modes that the original method does. My intention is to keep pushing on that until it can completely replace the original method, though I donā€™t yet have a sense of the timeline for that. You can find the method here:

If you were so inclined, I think you could build ordinal regression by doing something similar. Itā€™s possible thereā€™s a method or two on ndarrays youā€™d need, but Iā€™d be happy to add them if you ran into that.

I donā€™t know what the performance characteristics would be, (currently my ndarray version of logreg is slower than the original), but we are continuing to work on ndarray performance, and so whatever it is it should at least improve over time.

Also happy to answer any questions about the linked code

1 Like

Thanks! This definitely sounds constructive. I will try to have an honest look and come back with questions.

Hi!
I havenā€™t yet made any progress with the ordinal regression, but I havenā€™t forgotten about it.
Just as a small sanity check and an unambicious proof to myself that I understand something about the code that @johnc1231 has linked to, Iā€™ll try to summarize what I think is happening there with the binary logistic regression. I hope itā€™s okay that Iā€™ll use this space as my scratchpad and the proverbial rubber duck. Hopefully, writing this down will make it easier - for me or anyone else - to generalize it to the ordinal regression.

So, basically one looks for the minimum of the negative-log-likelihood function by looking for the root of its first derivative. The latter is done with the Newtonā€™s method, much as it is described in the documentation of hl.experimental.loop (Hail | Experimental), only in a multi-variable setting.

Some notation for data and the model:

i = 1,\ldots, n \quad \text{(observations)} \\ j = 1,\ldots, m \quad\text{(features)} \\ X = \left[ X_{ij} \right] \;,\quad X_{ij} \in \mathbb{R} \;,\\ \forall_i X_{i0} = 1 \quad \text{(the intercept)} \\ \beta = \left[ \beta_j \right] \\ y = \left[ y_i \right] \;,\quad y_i \in \{0, 1\} \;,\\ s = \left[ s_i \right] \;,\quad s_i = 2 y_i - 1 = \left\{\begin{array}{ll} -1, & y_i = 0 \\ 1, & y_i = 1 \end{array} \right. \\

Sigmoid, logit, binary probabilities and the total negative-log-likelihood l:

\sigma(x) = \frac{1}{1 + \exp(-x)} \;,\quad \sigma^{-1}(x) = \log \frac{x}{1 - x} \\ p^{+}_i = p^{+}_i(\beta, X) = \sigma\left(\sum_{j} X_{ij} \beta_j \right) \\ p_i = p_i(\beta, X, y) = \sigma\left(s_i \sum_{j} X_{ij} \beta_j \right) \\ l(\beta, X, y) = - \sum_i \left[ y_i \log p^{+}_i + (1 - y_i) \log(1 - p^{+}_i) \right] = - \sum_i \log p_i

We want to find \beta = \beta^{\infty} such that \frac{\partial l}{\partial \beta} \Big|_{\beta^{\infty}} \!\!= 0. We will recursively update a \beta^n vector starting from an initial guess \beta^0, producing a sequence (\beta^n) until convergence defined us |\beta^{n} - \beta^{n - 1}|_1 < \text{tol} (with some arbitrary small \text{tol}). The updating is done with the following rationale.

0_j \approx \frac{\partial l}{\partial \beta_j} \bigg|_{\beta^{n+1}} = \frac{\partial l}{\partial \beta_j}\bigg|_{\beta^n} + \sum_{k=1}^{m} \frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \bigg|_{\beta^n} \delta\beta^n_k + \ldots \\

so approximate \delta\beta^n by solving the linear system

A^n \cdot \delta\beta^n = B^n \,,\; \text{with} \\ A^n = \left[ A^n_{jk}\right] = \left[\frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \right]\bigg|_{\beta^n} \\ B^n = \left[B^n_j\right] = \left[ - \frac{\partial l}{\partial \beta_j} \right] \bigg|_{\beta^n}

and update the coeffients via \beta^{n+1} = \beta^n + \delta\beta^n.

The derivatives of the loss are

-\frac{\partial l}{\partial \beta_j} = \sum_i s_i X_{ij} ( 1- p_i) = \sum_i X_{ij} (y_i - p^{+}_i) \\ \frac{\partial^2l}{\partial\beta_j \partial \beta_k} = \sum_i X_{ij} X_{ik} p_i (1 - p_i) = \sum_i X_{ij} X_{ik} p^{+}_i (1 - p^{+}_i)

And a reasonable ansatz for \beta^0 is

\beta^0 = \left[\sigma^{-1}\!\!\left(\frac{\sum_i y_i}{n}\right), 0, \ldots, 0 \right]

which has the advantage that it satisfies \frac{\partial l}{\partial \beta_0} \Big|_{\beta^0} \!\!= 0, in light of the first column of X being all ones, X_{i0} = 1.

The code by @johnc1231 has those derivatives and the ansatz explicitly typed in.

Iā€™ll generate some toy data in python and try to infer the betas using np.linalg.solve, scipy.optimize.minimize, sklearn.linear_model.LogisticRegression, and hl.nd.solve, to see if Iā€™ll get same result in the four cases.

import numpy as np
rng = np.random.default_rng(seed=42)
import scipy
import sklearn 
import hail as hl
hl.init()
print(np.__version__, scipy.__version__, sklearn.__version__, hl.__version__, sep=" | ")

1.20.2 | 1.6.2 | 0.24.2 | 0.2.69-6d2bd28a8849

def sigmoid(arr):
    result = np.empty(arr.shape, dtype='float')
    positive = arr > 0
    negative = ~positive
    result[positive] = 1 / (1 + np.exp(-arr[positive]))
    exp_arr = np.exp(arr[negative])
    result[negative] = exp_arr / (exp_arr + 1)
    return result

m = 5
n = 1234
_beta = np.array([-0.5, 0.1, 0.2, 1, 2])
X = np.concatenate([
    np.ones((n, 1)),
    rng.standard_normal((n, m-1))
], axis=1)

_Xbeta = X @ _beta
_p = sigmoid(_Xbeta)
y = rng.binomial(1, _p)

Hereā€™s what this toy data looks like

Some definitions and the solution from numpy:

def logit(arr):
    return np.log(arr / (1 - arr))

def L(beta, X, y):
    s = 2 * y - 1
    sXbeta = s * (X @ beta)
    result = np.empty(sXbeta.shape, dtype='float')
    positive = sXbeta > 0
    result[positive] = np.log(1 + np.exp(-sXbeta[positive]))
    result[~positive] = -sXbeta[~positive] + np.log(1 + np.exp(sXbeta[~positive]))
    return result.sum()

def dLdbeta(beta, X, y):
    p = sigmoid(X @ beta)
    return X.T @ (p - y)

def d2Ldbeta2(beta, X, y=None):
    p = sigmoid(X @ beta)
    p1mp = p * (1 - p)
    XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
    return XXp1mp.sum(axis=0)

def get_beta_0(y, m=m):
    beta_0 = np.zeros(m, dtype='float')
    beta_0[0] = logit(y.mean())
    return beta_0

def delta_beta(beta, X, y):
    B = -dLdbeta(beta, X, y)
    A = d2Ldbeta2(beta, X)
    return np.linalg.solve(A, B)

def get_beta(beta, X, y, tol):
    delta = delta_beta(beta, X, y)
    beta = beta + delta
    if np.max(np.abs(delta)) < tol:
        return beta
    return get_beta(beta, X, y, tol)

beta_0 = get_beta_0(y)
beta_numpy = get_beta(beta_0, X, y, tol=1e-6)

Minimizing l with scipy:

sol = scipy.optimize.minimize(
    L,
    get_beta_0(y),
    jac=dLdbeta,
    hess=d2Ldbeta2,
    args=(X, y),
    tol=1e-6,
    method="Newton-CG"
)
beta_scipy = sol.x

Fitting logistic regression in scikit-learn:

sk_logreg = sklearn.linear_model.LogisticRegression(
    penalty='none',
    fit_intercept=False,
    tol=1e-6,
    solver='newton-cg'
)
sk_logreg.fit(X, y)
beta_sklearn = sk_logreg.coef_

And finally, doing the same thing as in numpy, but with hail.nd:

def hl_scalar_sigmoid(x):
    return hl.if_else(
        x > 0,
        1 / (1 + hl.exp(-x)),
        hl.rbind(
            hl.exp(x),
            lambda _: _ / (1 + _)
        )
    )

def hl_sigmoid(hl_nd):
    return hl_nd.map(hl_scalar_sigmoid)

def hl_dLdbeta(beta, X, y):
    p = hl_sigmoid(X @ beta)
    return X.T @ (p - y)

def hl_d2Ldbeta2(beta, X, y=None):
    p = hl_sigmoid(X @ beta)
    p1mp = p * (1 - p)
    XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
    return XXp1mp.sum(axis=0)

def hl_get_beta_0(hl_y, m=m):
    y_mean = hl.mean(
        hl_y._data_array()
    )
    beta_00 = hl.log(y_mean / (1 - y_mean))
    return hl.nd.hstack([
        hl.nd.array([beta_00]),
        hl.nd.zeros(
            m - 1,
            dtype=beta_00.dtype
        )
    ])

def hl_delta_beta(beta, X, y):
    B = -hl_dLdbeta(beta, X, y)
    A = hl_d2Ldbeta2(beta, X)
    return hl.nd.solve(A, B)

def _hl_get_beta(_, beta, X, y, tol):
    delta = hl_delta_beta(beta, X, y)
    beta = beta + delta
    converged = hl.max(
        hl.abs(delta)._data_array()
    ) < tol
    return hl.if_else(
        converged,
        beta,
        _(beta, X, y, tol)
    )

def hl_get_beta(beta, X, y, tol):
    return hl.experimental.loop(
        _hl_get_beta,
        hl.dtype('ndarray<float64, 1>'),
        beta, X, y, tol
    )

hl_X = hl.nd.array(X)
hl_y = hl.nd.array(y)
hl_beta_0 = hl_get_beta_0(hl_y)
beta_hail = hl_get_beta(hl_beta_0, hl_X, hl_y, 1e-6).collect()[0]

And it works!

assert(
    np.allclose(beta_numpy, beta_scipy) &\
    np.allclose(beta_numpy, beta_hail) &\
    np.allclose(beta_numpy, beta_sklearn) &\
    np.allclose(beta_scipy, beta_hail) &\
    np.allclose(beta_scipy, beta_sklearn) &\
    np.allclose(beta_hail, beta_sklearn)
)

So, other than stripping away the, arguably more important part, which parallelizes the regression onto rows of a matrix-table, uses a prefit null model and computes p-values, this is what I believe is also happening in @johnc1231ā€™s code.

Again, let me know if Iā€™ve said something false or dumb, or if I seem to be misunderstanding Hailā€™s interface! Thanks :slight_smile:

1 Like

Ok, so onto the ordinal case. Itā€™s actually not too difficult to write down the derivatives of the log likelihood. Hereā€™s one notation (apologies for introducing yet another slightly different bunch of indices):

i = 1,\ldots, n \quad \text{(samples)} \\ j = 1, \ldots, m \quad \text{(features)} \\ a = 1, \ldots, k-1 \quad \text{(thresholds)} \\ k>2 \\ X = \left[X_{ij}\right] \;,\quad X_{ij} \in \mathbb{R} \\ y = \left[y_i\right] \;,\quad y_i \in \{ 1,\ldots, k \} \\ \beta = \left[\beta_j \right] \;,\quad \beta_j \in \mathbb{R} \\ \theta = \left[ \theta_a \right] \;,\quad \theta_a \in \mathbb{R} \;,\quad \forall_{a = 1,\ldots,k-2}\; (\theta_a \!<\! \theta_{a + 1}) \\ \eta = \left[\eta_a \right] \\ \eta_1 = \theta_1 \;,\quad \forall_{a=2,\ldots,k-1} \; (\eta_a \!=\! \theta_a \!-\! \theta_{a-1} )\\ \theta_a = \sum_{b=1}^{a} \eta_b \;,\quad \forall_{a=2, \ldots,k-1} \;(\eta_a \!>\! 0) \\ s = \left[ s_{ia}\right] \;,\quad s_{ia} = \left\{\begin{array}{ll} \;1, & y_i \le a \\ -1, & y_i > a \end{array} \right. \\ t = \left[ t_{ab}\right] \;,\quad t_{ab} = \left\{\begin{array}{ll} 1, & a \ge b \\ 0, & a < b \end{array} \right. \; \text{(lower triangle)} \\ \chi_{ia} = s_{ia}\left( \theta_a - \sum_{j}X_{ij}\beta_j \right) = s_{ia} \left( \sum_b t_{ab} \eta_b - \sum_{j}X_{ij}\beta_j \right) \\ p_{ia} = \sigma\left(\chi_{ia}\right) \\ l = l(\eta, \beta, X, y) = -\sum_{i,a} \log p_{ia}

The above l is supposed to be the AT loss from before. So now some intermediate derivatives are

\sigma'(x) = \sigma(x)\left[1 - \sigma(x)\right] \\ \frac{\partial \chi_{ia}}{\partial \eta_b} = s_{ia} t_{ab} \\ \frac{\partial \chi_{ia}}{\partial \beta_j} = -s_{ia} X_{ij} \\

And finally

\frac{\partial l}{\partial \eta_b} = -\sum_{i,a} \left(1 - p_{ia}\right) s_{ia} t_{ab} \\ \frac{\partial l}{\partial \beta_j} = \sum_{i,a} \left(1 - p_{ia}\right) s_{ia} X_{ij} \\ \frac{\partial^2 l}{\partial \eta_b \partial \eta_c} = \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) t_{ab}t_{ac} \\ \frac{\partial^2 l}{\partial \eta_b \partial \beta_j} =- \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) t_{ab}X_{ij} \\ \frac{\partial^2 l}{\partial \beta_j \partial \beta_k} = \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) X_{ij}X_{ik}

One last thing is the initial guess for the parameters. Drawing inspiration from the binary case, let it be:

\theta^0_a = \sigma^{-1} \left(\frac{\sum_i t_{a y_i}}{n}\right)\\ \beta^0_j = 0_j

I did, as before, generate some toy data from an arbitrary choice of the \eta,\,\beta parameters, and tried to infer them back by minimizing l, including in numpy with the simple Newtonā€™s method described previously. The punchline is that I got the same result from the handwritten Newton as from scipy.optimize.minimize and from mord, which hopefully suggests that the same could be done in hail.

Hereā€™s my adventures:

import numpy as np
rng = np.random.default_rng(seed=42)
import scipy
import mord
print(np.__version__, scipy.__version__, mord.__version__, sep=" | ")
# 1.20.2 | 1.6.2 | 0.6
def sigmoid(arr):
    result = np.empty(arr.shape, dtype='float')
    pos = arr > 0
    neg = ~pos
    result[pos] = 1 / (1 + np.exp(-arr[pos]))
    exp_arr = np.exp(arr[neg])
    result[neg] = exp_arr / (exp_arr + 1)
    return result

k = 5
m = 4
n = 1234
_beta = np.array([0.1, 0.2, 1, 2])
_theta = np.array([10, 16, 17, 20])
X = rng.uniform(0, 8, (n, m))

_Xbeta = X @ _beta
_p = sigmoid(_theta[None, :] - _Xbeta[:, None] + 2 * rng.standard_normal(n)[:, None])
y = 1 + (_p < 0.5).sum(axis=1)

Some plots of the X, y

Translating the LaTeX to numpy:

def theta(eta):
    return np.cumsum(eta)

def s(y, k=k):
    s = 2 * (y[:, None] <= np.arange(1, k)[None, :]) - 1
    return s.astype('float')

def t(k=k):
    # lower triagle, np.tril
    t = np.arange(1, k)[:, None] >= np.arange(1, k)[None, :]
    return t.astype('float')

def chi(eta, beta, X, y):
    return s(y) * (
        theta(eta)[None, :] - (X @ beta)[:, None]
    )

def p(eta, beta, X, y):
    return sigmoid(chi(eta, beta, X, y))

def L(eta_beta, X, y, k=k):
    eta = eta_beta[:k-1]
    beta = eta_beta[k-1:]
    return -np.log(p(eta, beta, X, y)).sum()

def dLdeta(eta, beta, X, y):
    _p = p(eta, beta, X, y)
    _1 = ((1 - _p) * s(y)).sum(axis=0)
    _2 = (_1[:, None] * t()).sum(axis=0)
    return -_2

def dLdbeta(eta, beta, X, y):
    _p = p(eta, beta, X, y)
    _1 = ((1 - _p) * s(y)).sum(axis=1)
    _2 = (_1[:, None] * X).sum(axis=0)
    return _2

def dLdeta_beta(eta_beta, X, y, k=k):
    eta = eta_beta[:k-1]
    beta = eta_beta[k-1:]
    return np.concatenate([
        dLdeta(eta, beta, X, y),
        dLdbeta(eta, beta, X, y)
    ])

def d2Ldeta2(eta, beta, X, y):
    _p = p(eta, beta, X, y)
    _1 = (_p * (1 - _p)).sum(axis=0)
    _2 = t()[:, :, None] * t()[:, None, :] 
    _3 = (_1[:, None, None] * _2).sum(axis=0)
    return _3

def d2Ldetadbeta(eta, beta, X, y):
    _p = p(eta, beta, X, y)
    _1 = _p * (1 - _p)
    _2 = (_1[:, :, None] * t()[None, :, :]).sum(axis=1)
    _3 = (_2[:, :, None] * X[:, None, :]).sum(axis=0)
    return  -_3

def d2Ldbeta2(eta, beta, X, y):
    _p = p(eta, beta, X, y)
    _1 = (_p * (1 - _p)).sum(axis=1)
    _2 = _1[:, None, None] * X[:, :, None] * X[:, None, :]
    return _2.sum(axis=0)

def d2Ldeta_beta2(eta_beta, X, y, k=k):
    eta = eta_beta[0:k-1]
    beta = eta_beta[k-1:]
    _d2Ldeta2 = d2Ldeta2(eta, beta, X, y)
    _d2Ldetadbeta = d2Ldetadbeta(eta, beta, X, y)
    _d2Ldbeta2 = d2Ldbeta2(eta, beta, X, y)
    
    return np.concatenate([
        np.concatenate([
            _d2Ldeta2, _d2Ldetadbeta
        ], axis=1),
        np.concatenate([
            _d2Ldetadbeta.T, _d2Ldbeta2
        ], axis=1)
    ], axis=0)

def logit(arr):
    return np.log(arr / (1 - arr))

def get_eta_beta_0(y, k=k, m=m):
    _t = (np.arange(1, k)[:, None] >= y[None, :]).astype('float')
    _theta = logit(_t.mean(axis=1))
    _eta = np.diff(_theta)
    return np.concatenate([
        np.array([_theta[0]]),
        _eta,
        np.zeros(m)
    ])

Solution from np.linalg.solve:

def get_delta_eta_beta(eta_beta, X, y, k=k):
    B = -dLdeta_beta(eta_beta, X, y)
    A = d2Ldeta_beta2(eta_beta, X, y)
    return np.linalg.solve(A, B)

def get_eta_beta(eta_beta, X, y, tol, k=k):
    delta = get_delta_eta_beta(eta_beta, X, y, k=k)
    eta_beta = eta_beta + delta
    eta_idx = slice(1, k-1)
    eta_beta[eta_idx] = np.maximum(eta_beta[eta_idx], 1e-5)
    if np.max(np.abs(delta)) < tol:
        return eta_beta
    return get_eta_beta(eta_beta, X, y, tol)

eta_beta_0 = get_eta_beta_0(y)
eta_beta_numpy = get_eta_beta(eta_beta_0, X, y, 1e-6)
eta_numpy = eta_beta_numpy[0:k-1]
theta_numpy = theta(eta_numpy)
beta_numpy = eta_beta_numpy[k-1:]

(Notice the naive enforcement of the \eta_{2,\ldots} > 0 constraints via np.maximum in the geta_eta_beta.)

Solution from scipy.optimize.minimize:

sol = scipy.optimize.minimize(
    L,
    get_eta_beta_0(y),
    jac=dLdeta_beta,
    hess=d2Ldeta_beta2,
    args=(X, y),
    bounds \
        = [(None, None)] \
        + [(1e-5, None)]*(k-2) \
        + [(None, None)]*m,
    tol=1e-6,
    method="trust-exact"
)
eta_beta_scipy = sol.x
eta_scipy = eta_beta_scipy[0:k-1]
theta_scipy = theta(eta_scipy)
beta_scipy = eta_beta_scipy[k-1:]

(Donā€™t ask me what the trust-exact method is. Itā€™s supposed to be described in Nocedal, Wright, Numerical Optimization, 2nd ed.)

And a solution from mord:

mordAT = mord.LogisticAT(alpha=0).fit(X, y)
theta_mord = mordAT.theta_
beta_mord = mordAT.coef_

The three solutions agree

assert(
    np.allclose(theta_numpy, theta_scipy) & np.allclose(beta_numpy, beta_scipy) &\
    np.allclose(theta_numpy, theta_mord) & np.allclose(beta_numpy, beta_mord) &\
    np.allclose(theta_scipy, theta_mord) & np.allclose(beta_scipy, beta_mord)
)

Just if anyoneā€™s curious, the found thresholds arenā€™t great, but the betas seem reasonably close:

print(*_theta, sep="\t")
print(*theta_numpy.round(2), sep="\t")
10 16 17 20
8.7 14.26 14.97 17.27
print(*_beta, sep="\t")
print(*beta_numpy.round(3), sep="\t")
0.1 0.2 1.0 2.0
0.046 0.175 0.884 1.767

Couple of things:

  1. The scipy and mord methods call an underlying optimizer to take care about the \eta_{2,\ldots} > 0 constraints. I donā€™t know how reliable is the simple enforcement via elementwise-maximum. In the above example it doesnā€™t make any difference to completely comment it out in get_eta_beta, so I suppose I got lucky and this time the trajectory did not cross into \eta_{2,\ldots} \le 0. Iā€™ve also done some quick experiments with reparametrizing \eta_{2, \ldots} = \exp(\zeta_{2, \ldots}), but this seemed to have made the convergence much harder.

  2. Iā€™ve tried to naively rewrite the above numpy approach in hail, but Iā€™m getting a lot of Java Heap Out Of Memory errors even with tiny matrices, so I must be doing something terribly wrong. Iā€™ll need a minute to experiment some more and come back with complaints :slight_smile: If someone else can do the rewrite properly, please donā€™t hesitate to share.

  3. What seems particularly bad in my approach above is the constant packing and unpacking of the i and a indexed arrays between each update.

  4. To respond to the original kind remark by @johnc1231 about hl.nd functionalities that might be missing: When trying to rewrite the above approach from numpy to hail Iā€™ve found myself implementing silly versions of elementwise >=, np.cumsum, np.diff, np.mean, and np.maximum:

def hl_gte(nd_left, nd_right):
    # >=
    return hl.map(
        lambda _: hl.float(_>=0),
        (nd_left - nd_right)
    )

def hl_cumsum(nd):
    nd = nd.reshape(-1)
    return hl.nd.array(
        hl.cumulative_sum(
            nd._data_array()
        )
    )

def hl_diff(nd):
    nd = nd.reshape(-1)
    return nd[1:] - nd[:-1]

def hl_mean(nd, dim):
    return nd.sum(axis=dim) / nd.shape[dim]

def hl_maximum(nd_left, nd_right):
    return hl.nd.array(hl.zip(
        nd_left._data_array(),
        nd_right._data_array()
    ).map(
        lambda _: hl.max(*_)
    ))

Any suggestions how to properly implement (or circumvent) those functinalities will be welcome.

  1. Iā€™ve also found myself doing a lot of broadcasting, e.g. array_1[:, :, None] * array_2[:, None, :]. If there are some nuances related to that, e.g. about differences between using None-indexing vs reshaping, Iā€™d also love to hear about it.

Many thanks! :slight_smile:

1 Like

This is all super cool! Youā€™re definitely suffering for having to go back and forth between arrays and ndarray all the time, and we need to make this easier.

As a first pass, it should be easy for me to add implementations of your hl_maximum and hl_gte methods. Theyā€™re quick because they do a computation that is the same shape as adding two ndarrays, so should be expressable with our current infrastructure. cum_sum is the trickier one, but we could maybe implement easily as a multiply by a triangular matrix of 1s or something.

1 Like

Iā€™ll take some time to read through this more carefully in a little bit. But to address some of your specific points:

  1. Iā€™m surprised youā€™re getting out of memory errors on tiny matrices. How tiny are we talking? If you give me an example I can run I can investigate.

  2. As far as I know in numpy using the Nones (or the slightly more readable np.newaxis) when indexing is the same as reshaping. In hail, the way we implemented the None indexing is we break it apart into successive calls to the normal slicing operation and then reshaping, so for hail purposes, it should be identical.

Hi, @johnc1231!
Thanks so much for bearing with me :slight_smile:
Hereā€™s my adventures with trying to type this ordinal regression into Hail. Iā€™ve tried two slightly different approaches: one strictly analogous to the one described above, and a second one where Iā€™m packing the \eta and \beta parameter vectors into one, \gamma = [\eta\ldots, \beta\ldots].

More concretely, the second approach (using s and t defined before):

i = 1,\ldots,n \\ j = 1,\ldots,m \\ a = 1,\ldots,k-1 \\ \dot{a} = 1,\ldots,k-1+m \\ \gamma = [\gamma_{\dot{a}}] = [\eta_1, \eta_2, \ldots, \eta_{k-1}, \beta_1, \beta_2, \ldots, \beta_m] \\ \dot{X} = [\dot{X}_{ia\dot{a}}] \;,\quad \dot{X}_{ia\dot{a}} = \left\{\begin{array}{ll} t_{a\dot{a}}, & \dot{a} \le k-1 \\ -X_{i\dot{a}}, & \dot{a} > k-1 \end{array} \right. \\ \chi_{ia} = s_{ia} \sum_{\dot{a}} \dot{X}_{ia\dot{a}} \gamma_{\dot{a}} \;,\quad p_{ia} = \sigma(\chi_{ia}) \\ l = -\sum_{ia} \log p_{ia} \\ -\frac{\partial l}{\partial \gamma_{\dot{a}}} = \sum_{ia}(1 - p_{ia}) s_{ia} \dot{X}_{ia\dot{a}} \\ \frac{\partial^2 l}{\partial\gamma_{\dot{a}} \partial \gamma_{\dot{b}}} = \sum_{ia} p_{ia} (1 - p_{ia}) \dot{X}_{ia\dot{a}}\dot{X}_{ia\dot{b}}

Iā€™m setting up some dummy data same as before but choosing to make fewer samples (123) and fewer labels (3):

import numpy as np  # __version__ 1.20.2
rng = np.random.default_rng(seed=42)
import scipy  # __version__ 1.6.2
import hail as hl  # __version__ 0.2.69-6d2bd28a8849
import mord  # __version__ 0.6
hl.init()
def sigmoid(arr):
    result = np.empty(arr.shape, dtype='float')
    positive = arr > 0
    negative = ~positive
    result[positive] = 1 / (1 + np.exp(-arr[positive]))
    exp_arr = np.exp(arr[negative])
    result[negative] = exp_arr / (exp_arr + 1)
    return result

k = 3
m = 4
n = 123
_beta = np.array([0.1, 0.2, 1, 2])
_theta = np.array([10, 15])
X = rng.uniform(0, 8, (n, m))

_Xbeta = X @ _beta
_p = sigmoid(_theta[None, :] - _Xbeta[:, None] + 2 * rng.standard_normal(n)[:, None])
y = 1 + (_p < 0.5).sum(axis=1)

mord fits without complaints

mordAT = mord.LogisticAT(alpha=0).fit(X, y)
theta_mord = mordAT.theta_  # [ 9.12872647, 13.26146952]
beta_mord = mordAT.coef_  # [-0.05488618,  0.26912198,  1.06027319,  1.6837583 ]

The beginning of using Hail. First definitions that are common in the two approaches

hl_y = hl.nd.array(y)
hl_X = hl.nd.array(X)

def hl_gte(nd_left, nd_right):
    # >=
    return hl.map(
        lambda _: hl.float(_>=0),
        (nd_left - nd_right)
    )

def hl_scalar_sigmoid(x):
    return hl.if_else(
        x > 0,
        1 / (1 + hl.exp(-x)),
        hl.rbind(
            hl.exp(x),
            lambda _: _ / (1 + _)
        )
    )

def hl_sigmoid(hl_nd):
    return hl_nd.map(hl_scalar_sigmoid)

def hl_log(nd):
    return nd.map(lambda _: hl.log(_))

def hl_logit(nd):
    return hl_log(nd / (1 - nd))

def hl_mean(nd, dim):
    return nd.sum(axis=dim) / nd.shape[dim]

def hl_diff(nd):
    return nd[1:] - nd[:-1]

def hl_get_t(k):
    # lower triangle, np.tril
    i = hl.nd.arange(1, k)
    return hl_gte(
        i.reshape(-1, 1),
        i.reshape(1, -1)
    )

def hl_get_s(y, k):
    return 2 * hl_gte(
        hl.nd.arange(1, k).reshape(1, -1),
        y.reshape(-1, 1)
    ) - 1

def hl_maximum(nd_left, nd_right):
    return hl.nd.array(hl.zip(
        nd_left._data_array(),
        nd_right._data_array()
    ).map(
        lambda _: hl.max(*_)
    ))

def hl_get_gamma_0(y, k, m):
    t = hl_gte(
        hl.nd.arange(1, k)[:, None],
        y[None, :]
    )
    theta = hl_logit(
        hl_mean(t, 1)
    )
    eta = hl_diff(theta)
    return hl.nd.hstack([
        theta[:1],
        eta,
        hl.nd.zeros(m)
    ])

hl_get_eta_beta_0 = hl_get_gamma_0

def hl_get_gamma_lower_bound(k, m):
    return hl.nd.hstack([
        hl.nd.array([-np.inf]),
        hl.nd.array([0] * (k-2), dtype=hl.dtype('float')),
        hl.nd.array([-np.inf] * m)
    ])

hl_get_eta_beta_lower_bound = hl_get_gamma_lower_bound

The first approach

def hl_get_hess_negjac(X, t, s, p):
    
    _1 = ((1 - p) * s)[:, :, None]
    negdLdeta = (_1 * t[None, :, :]).sum(axis=(0, 1))
    negdLdbeta = -(_1 * X[:, None, :]).sum(axis=(0, 1))
    negjac = hl.nd.hstack([negdLdeta, negdLdbeta])
    
    _2 = p * (1 - p)
    d2Ldeta2 = (_2.sum(axis=0) * t[:, :, None] * t[:, None, :]).sum(axis=0)
    _3 = (_2[:, :, None] * t[None, :, :]).sum(axis=1)[:, :, None]
    d2Ldetadbeta = (_3 * X[:, None, :]).sum(axis=0)
    _4 = _2.sum(axis=1)[:, None, None] * X[:, :, None] * X[:, None, :]
    d2Ldbeta2 = _4.sum(axis=0)
    hess = hl.nd.vstack([
        hl.nd.hstack([
            d2Ldeta2, d2Ldetadbeta
        ]),
        hl.nd.hstack([
            d2Ldetadbeta.T, d2Ldbeta2
        ])
    ])
    return hess, negjac


def _hl_get_eta_beta(_, eta_beta, eta_beta_lower_bound, X, t, s, tol, k):
    eta = eta_beta[:k-1]
    beta = eta_beta[k-1:]
    p = hl_sigmoid(s * (
        (t @ eta)[None, :] - (X @ beta)[:, None]
    ))
    delta_eta_beta = hl.nd.solve(
        *hl_get_hess_negjac(X, t, s, p)
    )
    eta_beta = eta_beta + delta_eta_beta
    eta_beta = hl_maximum(eta_beta, eta_beta_lower_bound)
    converged = hl.max(
        hl.abs(delta_eta_beta)._data_array()
    ) < tol
    return hl.if_else(
        converged,
        eta_beta,
        _(eta_beta, eta_beta_lower_bound, X, t, s, tol, k)
    )

def hl_get_eta_beta(X, y, tol, k, m):
    eta_beta_0 = hl_get_eta_beta_0(y, k, m)
    eta_beta_lower_bound = hl_get_eta_beta_lower_bound(k, m)
    t = hl_get_t(k)
    s = hl_get_s(y, k)
    return hl.experimental.loop(
        _hl_get_eta_beta,
        hl.dtype('ndarray<float64, 1>'),
        eta_beta_0, eta_beta_lower_bound, X, t, s, tol, k
    )
hl_eta_beta = hl_get_eta_beta(hl_X, hl_y, 1e-6, k, m)
hl_eta_beta.take(1)[0] # Error summary: OutOfMemoryError: Java heap space

So thatā€™s the OutOfMemoryError, I was complaining about :slight_smile:

And for the second approach:

def hl_get_Xdot(X, k):
    # concat([t[None, :, :], -Xdot[:, None, :]], axis=2)
    t = hl_get_t(k)
    return hl.nd.concatenate([
        hl.nd.concatenate(
            hl.range(hl.int(X.shape[0])).map(
                lambda _: t[None, :, :]
            ), axis=0
        ),
        -hl.nd.concatenate(
            hl.range(hl.int(t.shape[0])).map(
                lambda _: X[:, None, :]
            ), axis=1
        )
    ], axis=2)


def hl_get_hess_negjac(Xdot, s, p):
    _1 = ((1 - p) * s)[:, :, None] * Xdot
    negjac = _1.sum(axis=(0, 1))
    _2 = (p * (1 - p))[:, :, None, None] * Xdot[:, :, :, None] * Xdot[:, :, None, :]
    hess = _2.sum(axis=(0, 1))
    return hess, negjac


def _hl_get_gamma(_, gamma, gamma_lower_bound, Xdot, s, tol):
    p = hl_sigmoid(
        s * (Xdot @ gamma)
    )
    delta_gamma = hl.nd.solve(
        *hl_get_hess_negjac(Xdot, s, p)
    )
    gamma = gamma + delta_gamma
    gamma = hl_maximum(gamma, gamma_lower_bound)
    converged = hl.max(
        hl.abs(delta_gamma)._data_array()
    ) < tol
    return hl.if_else(
        converged,
        gamma,
        _(gamma, gamma_lower_bound, Xdot, s, tol)
    )

def hl_get_gamma(X, y, tol, k, m):
    gamma_0 = hl_get_gamma_0(y, k, m)
    gamma_lower_bound = hl_get_gamma_lower_bound(k, m)
    Xdot = hl_get_Xdot(X, k)
    s = hl_get_s(y, k)
    return hl.experimental.loop(
        _hl_get_gamma,
        hl.dtype('ndarray<float64, 1>'),
        gamma_0, gamma_lower_bound, Xdot, s, tol
    )
hl_gamma = hl_get_gamma(hl_X, hl_y, 1e-5, k, m)
hl_gamma.take(1)[0]  # Error summary: HailException: Incompatible NDArrayshapes: [ 0 0 ] vs [ 123 2 ]

Here Iā€™m getting an error about incompatible array shapes, so itā€™s sound like some silly mistake in my definitions, but when I try to ā€˜loopā€™ step-by-step, without using hl.loop, Iā€™m stumbling on an another error, a ClassTooLargeException:

hl_gamma_0 = hl_get_gamma_0(hl_y, k, m)
hl_gamma_lower_bound = hl_get_gamma_lower_bound(k, m)
hl_Xdot = hl_get_Xdot(hl_X, k)
hl_s = hl_get_s(hl_y, k)
hl_p_0 = hl_sigmoid(
    hl_s * (hl_Xdot @ hl_gamma_0)
)
hl_delta_gamma_0 = hl.nd.solve(
    *hl_get_hess_negjac(hl_Xdot, hl_s, hl_p_0)
)
hl_delta_gamma_0.take(1)[0]  # Error summary: ClassTooLargeException: Class too large: __C62868Compiled

What do You think?
Any help with debugging/pointing-out-mistakes will be greatly appreciated! :slight_smile:

Will read carefully and see if I see anything to cause the errors.

As an update are two things in progress that will make this a better experience for you.

  1. Ndarray map2 by ammekk Ā· Pull Request #10635 Ā· hail-is/hail Ā· GitHub This PR adds hl.nd.maximum and hl.nd.minimum, which will replace your hacky version of hl_maximum. It also generally adds a map2 method that lets you go over elements of two ndarrays at the same time to implement other operations that look like elementwise zips. Should get merged tomorrow.

  2. We are retooling hail error messages to be more useful. Soon if you make a mistake in hail, like ndarray shape errors or index out of bounds or something, youā€™ll get a python stack trace that points you to the line of python that introduced the error, as opposed to the mostly unhelpful error summary youā€™re getting now.

1 Like

That map2 PR went in, so if you build latest hail from source youā€™ll have maximum and minumum (we will probably also release 0.2.71 this week, which will include those. There is also my_nd.map2(other_nd, lambda a, b: ...) now, which will allow you to write the elementwise comparison more easily. The ā€œtell me which Python line I made a mistake onā€ thing should happen in next couple of days as well.

I donā€™t yet see mistake youā€™re making causing shape errors. If you want to send me a python script I can try running it with that new Python line mistake finder and tell you which line it is, or if you have other things youā€™re working on you can wait for it to get merged into mainline hail.

Awesome!
Thereā€™s no rush on my side. The new features will probably be live before Iā€™ll sit down to it again.
But Iā€™m still curious as to Your thoughts about the OutOfMemoryError: did I do something silly causing Hail to build some infinite expression or something like that?

Iā€™m admittedly not sure whatā€™s going on with OutOfMemory. In the past when Iā€™ve run into things like this itā€™s been because of accidents about hailā€™s lazy evaluation. Like I had a bad one where I was doing:

solved_thing = hl.solve(A, B)
hl.range(100).map(lambda x: solved_thing[x, x])

or something, and that will redo the solve 100 times, allocating every time. Those can be very tricky and hard to find. Itā€™s extra tricky because we do have some smarts in python to fix this, like if you use the same expression twice we automatically insert a an hl.bind to prevent it from being computed twice, but one use inside a looping thing like range will get recomputed over and over. If you send me logs I might be able to decipher better whatā€™s going on.

Paweł, do you have the full stack trace for the OOM? that might help.

Yup! :slight_smile:
running the code that I was referring to as ā€œthe first approachā€ above,
https://gist.github.com/olszewskip/fb6003e3e5c1248009f6506e102bf505#file-gistfile1-txt
Iā€™m getting
hail-20210714-1858-0.2.69-6d2bd28a8849.log (4.5 KB)

Something insane is happening with that script, havenā€™t found the problem yet, but itā€™s generating a tremendous amount of code for some reason. Looking into it.

For your second version that gives you the incompatible shapes error, I havenā€™t managed to get such an error yet. So far when I try to run it I get issues about things being of type boolean unexpectedly. Hail is currently not as good about converting booleans to doubles explicitly the way numpy does.

That is curious about the second approach causing yet another kind of error; I have not seen that.
Just for completeness:

2A) Running Hail_ordinal_reg_dummy_1.py Ā· GitHub
produces ClassTooLargeException for me,
hail-log: hail-20210716-1452-0.2.69-6d2bd28a8849.log - Google Drive

2B) And running Hail_ordinal_reg_dummy_2.py Ā· GitHub
produces HailException: Incompatible NDArrayshapes: [ 123 2 ] vs [ 0 0 ] for me,
hail-log hail-20210716-1524-0.2.69-6d2bd28a8849.log (3.2 MB)

If you want to share the exact thing youā€™re running that leads to the wrong shape error, it would be very easy now for me to tell you where the mismatched dimensions are. If you built hail from source youā€™d also have the better error messages available now.

Thanks for your patience as we iron out the kinks in this.

1 Like

Thank You! :smiley:
(Iā€™ve edited my previous post with the ā€œincompatible shapesā€ case)

Great, ok. So the incompatible shapes one for me is just showing a memory leak / segfault situation. Itā€™s possible that (0, 0) shape you were seeing is just a result of reading from uninitialized memory or something. Iā€™m going to look into all of these and see what I can figure out.

1 Like