Ok, so onto the ordinal case. Itās actually not too difficult to write down the derivatives of the log likelihood. Hereās one notation (apologies for introducing yet another slightly different bunch of indices):
i = 1,\ldots, n \quad \text{(samples)} \\
j = 1, \ldots, m \quad \text{(features)} \\
a = 1, \ldots, k-1 \quad \text{(thresholds)} \\
k>2 \\
X = \left[X_{ij}\right] \;,\quad X_{ij} \in \mathbb{R} \\
y = \left[y_i\right] \;,\quad y_i \in \{ 1,\ldots, k \} \\
\beta = \left[\beta_j \right] \;,\quad \beta_j \in \mathbb{R} \\
\theta = \left[ \theta_a \right] \;,\quad \theta_a \in \mathbb{R}
\;,\quad \forall_{a = 1,\ldots,k-2}\; (\theta_a \!<\! \theta_{a + 1}) \\
\eta = \left[\eta_a \right] \\
\eta_1 = \theta_1 \;,\quad \forall_{a=2,\ldots,k-1} \; (\eta_a \!=\! \theta_a \!-\! \theta_{a-1} )\\
\theta_a = \sum_{b=1}^{a} \eta_b \;,\quad \forall_{a=2, \ldots,k-1} \;(\eta_a \!>\! 0) \\
s = \left[ s_{ia}\right] \;,\quad s_{ia} = \left\{\begin{array}{ll}
\;1, & y_i \le a \\
-1, & y_i > a
\end{array}
\right. \\
t = \left[ t_{ab}\right] \;,\quad t_{ab} = \left\{\begin{array}{ll}
1, & a \ge b \\
0, & a < b
\end{array}
\right. \; \text{(lower triangle)} \\
\chi_{ia} = s_{ia}\left( \theta_a - \sum_{j}X_{ij}\beta_j \right)
= s_{ia} \left( \sum_b t_{ab} \eta_b - \sum_{j}X_{ij}\beta_j \right) \\
p_{ia} = \sigma\left(\chi_{ia}\right) \\
l = l(\eta, \beta, X, y) = -\sum_{i,a} \log p_{ia}
The above l is supposed to be the AT loss from before. So now some intermediate derivatives are
\sigma'(x) = \sigma(x)\left[1 - \sigma(x)\right] \\
\frac{\partial \chi_{ia}}{\partial \eta_b} = s_{ia} t_{ab} \\
\frac{\partial \chi_{ia}}{\partial \beta_j} = -s_{ia} X_{ij} \\
And finally
\frac{\partial l}{\partial \eta_b} = -\sum_{i,a} \left(1 - p_{ia}\right) s_{ia} t_{ab} \\
\frac{\partial l}{\partial \beta_j} = \sum_{i,a} \left(1 - p_{ia}\right) s_{ia} X_{ij} \\
\frac{\partial^2 l}{\partial \eta_b \partial \eta_c} = \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) t_{ab}t_{ac} \\
\frac{\partial^2 l}{\partial \eta_b \partial \beta_j} =- \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) t_{ab}X_{ij} \\
\frac{\partial^2 l}{\partial \beta_j \partial \beta_k} = \sum_{i,a} p_{ia} \left(1 - p_{ia}\right) X_{ij}X_{ik}
One last thing is the initial guess for the parameters. Drawing inspiration from the binary case, let it be:
\theta^0_a = \sigma^{-1} \left(\frac{\sum_i t_{a y_i}}{n}\right)\\
\beta^0_j = 0_j
I did, as before, generate some toy data from an arbitrary choice of the \eta,\,\beta parameters, and tried to infer them back by minimizing l, including in numpy
with the simple Newtonās method described previously. The punchline is that I got the same result from the handwritten Newton as from scipy.optimize.minimize
and from mord
, which hopefully suggests that the same could be done in hail
.
Hereās my adventures:
import numpy as np
rng = np.random.default_rng(seed=42)
import scipy
import mord
print(np.__version__, scipy.__version__, mord.__version__, sep=" | ")
# 1.20.2 | 1.6.2 | 0.6
def sigmoid(arr):
result = np.empty(arr.shape, dtype='float')
pos = arr > 0
neg = ~pos
result[pos] = 1 / (1 + np.exp(-arr[pos]))
exp_arr = np.exp(arr[neg])
result[neg] = exp_arr / (exp_arr + 1)
return result
k = 5
m = 4
n = 1234
_beta = np.array([0.1, 0.2, 1, 2])
_theta = np.array([10, 16, 17, 20])
X = rng.uniform(0, 8, (n, m))
_Xbeta = X @ _beta
_p = sigmoid(_theta[None, :] - _Xbeta[:, None] + 2 * rng.standard_normal(n)[:, None])
y = 1 + (_p < 0.5).sum(axis=1)
Some plots of the X, y
Translating the LaTeX to numpy
:
def theta(eta):
return np.cumsum(eta)
def s(y, k=k):
s = 2 * (y[:, None] <= np.arange(1, k)[None, :]) - 1
return s.astype('float')
def t(k=k):
# lower triagle, np.tril
t = np.arange(1, k)[:, None] >= np.arange(1, k)[None, :]
return t.astype('float')
def chi(eta, beta, X, y):
return s(y) * (
theta(eta)[None, :] - (X @ beta)[:, None]
)
def p(eta, beta, X, y):
return sigmoid(chi(eta, beta, X, y))
def L(eta_beta, X, y, k=k):
eta = eta_beta[:k-1]
beta = eta_beta[k-1:]
return -np.log(p(eta, beta, X, y)).sum()
def dLdeta(eta, beta, X, y):
_p = p(eta, beta, X, y)
_1 = ((1 - _p) * s(y)).sum(axis=0)
_2 = (_1[:, None] * t()).sum(axis=0)
return -_2
def dLdbeta(eta, beta, X, y):
_p = p(eta, beta, X, y)
_1 = ((1 - _p) * s(y)).sum(axis=1)
_2 = (_1[:, None] * X).sum(axis=0)
return _2
def dLdeta_beta(eta_beta, X, y, k=k):
eta = eta_beta[:k-1]
beta = eta_beta[k-1:]
return np.concatenate([
dLdeta(eta, beta, X, y),
dLdbeta(eta, beta, X, y)
])
def d2Ldeta2(eta, beta, X, y):
_p = p(eta, beta, X, y)
_1 = (_p * (1 - _p)).sum(axis=0)
_2 = t()[:, :, None] * t()[:, None, :]
_3 = (_1[:, None, None] * _2).sum(axis=0)
return _3
def d2Ldetadbeta(eta, beta, X, y):
_p = p(eta, beta, X, y)
_1 = _p * (1 - _p)
_2 = (_1[:, :, None] * t()[None, :, :]).sum(axis=1)
_3 = (_2[:, :, None] * X[:, None, :]).sum(axis=0)
return -_3
def d2Ldbeta2(eta, beta, X, y):
_p = p(eta, beta, X, y)
_1 = (_p * (1 - _p)).sum(axis=1)
_2 = _1[:, None, None] * X[:, :, None] * X[:, None, :]
return _2.sum(axis=0)
def d2Ldeta_beta2(eta_beta, X, y, k=k):
eta = eta_beta[0:k-1]
beta = eta_beta[k-1:]
_d2Ldeta2 = d2Ldeta2(eta, beta, X, y)
_d2Ldetadbeta = d2Ldetadbeta(eta, beta, X, y)
_d2Ldbeta2 = d2Ldbeta2(eta, beta, X, y)
return np.concatenate([
np.concatenate([
_d2Ldeta2, _d2Ldetadbeta
], axis=1),
np.concatenate([
_d2Ldetadbeta.T, _d2Ldbeta2
], axis=1)
], axis=0)
def logit(arr):
return np.log(arr / (1 - arr))
def get_eta_beta_0(y, k=k, m=m):
_t = (np.arange(1, k)[:, None] >= y[None, :]).astype('float')
_theta = logit(_t.mean(axis=1))
_eta = np.diff(_theta)
return np.concatenate([
np.array([_theta[0]]),
_eta,
np.zeros(m)
])
Solution from np.linalg.solve
:
def get_delta_eta_beta(eta_beta, X, y, k=k):
B = -dLdeta_beta(eta_beta, X, y)
A = d2Ldeta_beta2(eta_beta, X, y)
return np.linalg.solve(A, B)
def get_eta_beta(eta_beta, X, y, tol, k=k):
delta = get_delta_eta_beta(eta_beta, X, y, k=k)
eta_beta = eta_beta + delta
eta_idx = slice(1, k-1)
eta_beta[eta_idx] = np.maximum(eta_beta[eta_idx], 1e-5)
if np.max(np.abs(delta)) < tol:
return eta_beta
return get_eta_beta(eta_beta, X, y, tol)
eta_beta_0 = get_eta_beta_0(y)
eta_beta_numpy = get_eta_beta(eta_beta_0, X, y, 1e-6)
eta_numpy = eta_beta_numpy[0:k-1]
theta_numpy = theta(eta_numpy)
beta_numpy = eta_beta_numpy[k-1:]
(Notice the naive enforcement of the \eta_{2,\ldots} > 0 constraints via np.maximum
in the geta_eta_beta
.)
Solution from scipy.optimize.minimize
:
sol = scipy.optimize.minimize(
L,
get_eta_beta_0(y),
jac=dLdeta_beta,
hess=d2Ldeta_beta2,
args=(X, y),
bounds \
= [(None, None)] \
+ [(1e-5, None)]*(k-2) \
+ [(None, None)]*m,
tol=1e-6,
method="trust-exact"
)
eta_beta_scipy = sol.x
eta_scipy = eta_beta_scipy[0:k-1]
theta_scipy = theta(eta_scipy)
beta_scipy = eta_beta_scipy[k-1:]
(Donāt ask me what the trust-exact
method is. Itās supposed to be described in Nocedal, Wright, Numerical Optimization, 2nd ed.)
And a solution from mord
:
mordAT = mord.LogisticAT(alpha=0).fit(X, y)
theta_mord = mordAT.theta_
beta_mord = mordAT.coef_
The three solutions agree
assert(
np.allclose(theta_numpy, theta_scipy) & np.allclose(beta_numpy, beta_scipy) &\
np.allclose(theta_numpy, theta_mord) & np.allclose(beta_numpy, beta_mord) &\
np.allclose(theta_scipy, theta_mord) & np.allclose(beta_scipy, beta_mord)
)
Just if anyoneās curious, the found thresholds arenāt great, but the betas seem reasonably close:
print(*_theta, sep="\t")
print(*theta_numpy.round(2), sep="\t")
10 |
16 |
17 |
20 |
8.7 |
14.26 |
14.97 |
17.27 |
print(*_beta, sep="\t")
print(*beta_numpy.round(3), sep="\t")
0.1 |
0.2 |
1.0 |
2.0 |
0.046 |
0.175 |
0.884 |
1.767 |
Couple of things:
-
The scipy
and mord
methods call an underlying optimizer to take care about the \eta_{2,\ldots} > 0 constraints. I donāt know how reliable is the simple enforcement via elementwise-maximum. In the above example it doesnāt make any difference to completely comment it out in get_eta_beta
, so I suppose I got lucky and this time the trajectory did not cross into \eta_{2,\ldots} \le 0. Iāve also done some quick experiments with reparametrizing \eta_{2, \ldots} = \exp(\zeta_{2, \ldots}), but this seemed to have made the convergence much harder.
-
Iāve tried to naively rewrite the above numpy
approach in hail
, but Iām getting a lot of Java Heap Out Of Memory
errors even with tiny matrices, so I must be doing something terribly wrong. Iāll need a minute to experiment some more and come back with complaints If someone else can do the rewrite properly, please donāt hesitate to share.
-
What seems particularly bad in my approach above is the constant packing and unpacking of the i
and a
indexed arrays between each update.
-
To respond to the original kind remark by @johnc1231 about hl.nd
functionalities that might be missing: When trying to rewrite the above approach from numpy
to hail
Iāve found myself implementing silly versions of elementwise >=
, np.cumsum
, np.diff
, np.mean
, and np.maximum
:
def hl_gte(nd_left, nd_right):
# >=
return hl.map(
lambda _: hl.float(_>=0),
(nd_left - nd_right)
)
def hl_cumsum(nd):
nd = nd.reshape(-1)
return hl.nd.array(
hl.cumulative_sum(
nd._data_array()
)
)
def hl_diff(nd):
nd = nd.reshape(-1)
return nd[1:] - nd[:-1]
def hl_mean(nd, dim):
return nd.sum(axis=dim) / nd.shape[dim]
def hl_maximum(nd_left, nd_right):
return hl.nd.array(hl.zip(
nd_left._data_array(),
nd_right._data_array()
).map(
lambda _: hl.max(*_)
))
Any suggestions how to properly implement (or circumvent) those functinalities will be welcome.
- Iāve also found myself doing a lot of broadcasting, e.g.
array_1[:, :, None] * array_2[:, None, :]
. If there are some nuances related to that, e.g. about differences between using None
-indexing vs reshaping, Iād also love to hear about it.
Many thanks!