Hi!

I havenâ€™t yet made any progress with the ordinal regression, but I havenâ€™t forgotten about it.

Just as a small sanity check and an unambicious proof to myself that I understand something about the code that @johnc1231 has linked to, Iâ€™ll try to summarize what I think is happening there with the binary logistic regression. I hope itâ€™s okay that Iâ€™ll use this space as my scratchpad and the proverbial rubber duck. Hopefully, writing this down will make it easier - for me or anyone else - to generalize it to the ordinal regression.

So, basically one looks for the minimum of the negative-log-likelihood function by looking for the root of its first derivative. The latter is done with the Newtonâ€™s method, much as it is described in the documentation of `hl.experimental.loop`

(Hail | Experimental), only in a multi-variable setting.

Some notation for data and the model:

i = 1,\ldots, n \quad \text{(observations)} \\
j = 1,\ldots, m \quad\text{(features)} \\
X = \left[ X_{ij} \right] \;,\quad
X_{ij} \in \mathbb{R} \;,\\
\forall_i X_{i0} = 1 \quad \text{(the intercept)} \\
\beta = \left[ \beta_j \right] \\
y = \left[ y_i \right] \;,\quad
y_i \in \{0, 1\} \;,\\
s = \left[ s_i \right] \;,\quad
s_i = 2 y_i - 1 = \left\{\begin{array}{ll}
-1, & y_i = 0 \\
1, & y_i = 1
\end{array}
\right. \\

Sigmoid, logit, binary probabilities and the total negative-log-likelihood l:

\sigma(x) = \frac{1}{1 + \exp(-x)} \;,\quad
\sigma^{-1}(x) = \log \frac{x}{1 - x} \\
p^{+}_i = p^{+}_i(\beta, X) = \sigma\left(\sum_{j} X_{ij} \beta_j \right) \\
p_i = p_i(\beta, X, y) = \sigma\left(s_i \sum_{j} X_{ij} \beta_j \right) \\
l(\beta, X, y) = - \sum_i \left[ y_i \log p^{+}_i + (1 - y_i) \log(1 - p^{+}_i) \right]
= - \sum_i \log p_i

We want to find \beta = \beta^{\infty} such that \frac{\partial l}{\partial \beta} \Big|_{\beta^{\infty}} \!\!= 0. We will recursively update a \beta^n vector starting from an initial guess \beta^0, producing a sequence (\beta^n) until convergence defined us |\beta^{n} - \beta^{n - 1}|_1 < \text{tol} (with some arbitrary small \text{tol}). The updating is done with the following rationale.

0_j \approx \frac{\partial l}{\partial \beta_j} \bigg|_{\beta^{n+1}} =
\frac{\partial l}{\partial \beta_j}\bigg|_{\beta^n} +
\sum_{k=1}^{m} \frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \bigg|_{\beta^n} \delta\beta^n_k + \ldots \\

so approximate \delta\beta^n by solving the linear system

A^n \cdot \delta\beta^n = B^n \,,\;
\text{with} \\
A^n = \left[ A^n_{jk}\right] = \left[\frac{\partial^2 l}{\partial\beta_j \partial\beta_k} \right]\bigg|_{\beta^n} \\
B^n = \left[B^n_j\right] = \left[ - \frac{\partial l}{\partial \beta_j} \right] \bigg|_{\beta^n}

and update the coeffients via \beta^{n+1} = \beta^n + \delta\beta^n.

The derivatives of the loss are

-\frac{\partial l}{\partial \beta_j} = \sum_i s_i X_{ij} ( 1- p_i) = \sum_i X_{ij} (y_i - p^{+}_i) \\
\frac{\partial^2l}{\partial\beta_j \partial \beta_k} = \sum_i X_{ij} X_{ik} p_i (1 - p_i) = \sum_i X_{ij} X_{ik} p^{+}_i (1 - p^{+}_i)

And a reasonable ansatz for \beta^0 is

\beta^0 = \left[\sigma^{-1}\!\!\left(\frac{\sum_i y_i}{n}\right), 0, \ldots, 0 \right]

which has the advantage that it satisfies \frac{\partial l}{\partial \beta_0} \Big|_{\beta^0} \!\!= 0, in light of the first column of X being all ones, X_{i0} = 1.

The code by @johnc1231 has those derivatives and the ansatz explicitly typed in.

Iâ€™ll generate some toy data in python and try to infer the betas using `np.linalg.solve`

, `scipy.optimize.minimize`

and `hl.nd.solve`

, to see if Iâ€™ll get same result in the three cases.

```
import numpy as np
rng = np.random.default_rng(seed=42)
import scipy
import hail as hl
hl.init()
print(np.__version__, scipy.__version__, hl.__version__, sep=" | ")
```

`1.20.2 | 1.6.2 | 0.2.69-6d2bd28a8849`

```
def sigmoid(arr):
result = np.empty(arr.shape, dtype='float')
positive = arr > 0
negative = ~positive
result[positive] = 1 / (1 + np.exp(-arr[positive]))
exp_arr = np.exp(arr[negative])
result[negative] = exp_arr / (exp_arr + 1)
return result
m = 5
n = 1234
_beta = np.array([-0.5, 0.1, 0.2, 1, 2])
X = np.concatenate([
np.ones((n, 1)),
rng.standard_normal((n, m-1))
], axis=1)
_Xbeta = X @ _beta
_p = sigmoid(_Xbeta)
y = rng.binomial(1, _p)
```

Hereâ€™s what this toy data looks like

Some definitions and the solution from numpy:

```
def logit(arr):
return np.log(arr / (1 - arr))
def L(beta, X, y):
s = 2 * y - 1
sXbeta = s * (X @ beta)
result = np.empty(sXbeta.shape, dtype='float')
positive = sXbeta > 0
result[positive] = np.log(1 + np.exp(-sXbeta[positive]))
result[~positive] = -sXbeta[~positive] + np.log(1 + np.exp(sXbeta[~positive]))
return result.sum()
def dLdbeta(beta, X, y):
p = sigmoid(X @ beta)
return X.T @ (p - y)
def d2Ldbeta2(beta, X, y=None):
p = sigmoid(X @ beta)
p1mp = p * (1 - p)
XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
return XXp1mp.sum(axis=0)
def get_beta_0(y, m=m):
beta_0 = np.zeros(m, dtype='float')
beta_0[0] = logit(y.mean())
return beta_0
def delta_beta(beta, X, y):
B = -dLdbeta(beta, X, y)
A = d2Ldbeta2(beta, X)
return np.linalg.solve(A, B)
def get_beta(beta, X, y, tol):
delta = delta_beta(beta, X, y)
beta = beta + delta
if np.max(np.abs(delta)) < tol:
return beta
return get_beta(beta, X, y, tol)
beta_0 = get_beta_0(y)
beta_numpy = get_beta(beta_0, X, y, tol=1e-6)
```

Minimizing `l`

with `scipy`

:

```
sol = scipy.optimize.minimize(
L,
get_beta_0(y),
jac=dLdbeta,
hess=d2Ldbeta2,
args=(X, y),
tol=1e-6,
method="Newton-CG"
)
beta_scipy = sol.x
```

And finally, doing the same thing as in numpy, but with `hail.nd`

:

```
def hl_scalar_sigmoid(x):
return hl.if_else(
x > 0,
1 / (1 + hl.exp(-x)),
hl.rbind(
hl.exp(x),
lambda _: _ / (1 + _)
)
)
def hl_sigmoid(hl_nd):
return hl_nd.map(hl_scalar_sigmoid)
def hl_dLdbeta(beta, X, y):
p = hl_sigmoid(X @ beta)
return X.T @ (p - y)
def hl_d2Ldbeta2(beta, X, y=None):
p = hl_sigmoid(X @ beta)
p1mp = p * (1 - p)
XXp1mp = X[:, :, None] * X[:, None, :] * p1mp[:, None, None]
return XXp1mp.sum(axis=0)
def hl_get_beta_0(hl_y, m=m):
y_mean = hl.mean(
hl_y._data_array()
)
beta_00 = hl.log(y_mean / (1 - y_mean))
return hl.nd.hstack([
hl.nd.array([beta_00]),
hl.nd.zeros(
m - 1,
dtype=beta_00.dtype
)
])
def hl_delta_beta(beta, X, y):
B = -hl_dLdbeta(beta, X, y)
A = hl_d2Ldbeta2(beta, X)
return hl.nd.solve(A, B)
def _hl_get_beta(_, beta, X, y, tol):
delta = hl_delta_beta(beta, X, y)
beta = beta + delta
converged = hl.max(
hl.abs(delta)._data_array()
) < tol
return hl.if_else(
converged,
beta,
_(beta, X, y, tol)
)
def hl_get_beta(beta, X, y, tol):
return hl.experimental.loop(
_hl_get_beta,
hl.dtype('ndarray<float64, 1>'),
beta, X, y, tol
)
hl_X = hl.nd.array(X)
hl_y = hl.nd.array(y)
hl_beta_0 = hl_get_beta_0(hl_y)
beta_hail = hl_get_beta(hl_beta_0, hl_X, hl_y, 1e-6).collect()[0]
```

And it works!

```
assert(
np.allclose(beta_numpy, beta_scipy) &\
np.allclose(beta_numpy, beta_hail) &\
np.allclose(beta_scipy, beta_hail)
)
```

So, other than stripping away the, arguably more important part, which parallelizes the regression onto rows of a matrix-table, uses a prefit null model and computes p-values, this is what I believe is also happening in @johnc1231â€™s code.

Again, let me know if Iâ€™ve said something false or dumb, or if I seem to be missunderstanding Hailâ€™s interface! Thanks