python numpy machine-learning optimization

How can i compute the following hessian using numpy functions to speed up the computation?

I have to implement an equivalent function to compute the hessian of the logistic loss, written as a sum of logarithm of exponential terms. I have implemented the following function in python:

def hessian(self,w,hess_trick=0):
        hess = 0
        for x_i,y_i in zip(self.data, self.labels):
            hess += np.exp(y_i * np.dot(w.T, x_i))/((1 + np.exp(y_i * np.dot(w.T,x_i)))**2) *       np.outer(x_i,x_i.T)
        return hess + lambda_reg * np.identity(w.shape[0]) + hess_trick * 10**(-12) *    np.identity(w.shape[0])

My question is how i can code an equivalent, but faster function, without using the slow python for?

Since i am not so confident with numpy, i tried to code the following function:

    def new_hessian(self, w, hess_trick=0):
        exp_term = np.exp(self.labels * np.dot(self.data, w))
        sigmoid_term = 1 + exp_term
        inv_sigmoid_sq = 1 / sigmoid_term ** 2

        diag_elements = np.sum((exp_term * inv_sigmoid_sq)[:, np.newaxis] * self.data ** 2, axis=0)
        off_diag_elements = np.dot((exp_term * inv_sigmoid_sq) * self.data.T, self.data)
        hess = np.diag(diag_elements) + off_diag_elements
        regularization = lambda_reg * np.identity(w.shape[0])

        hess += hess_trick * 1e-12 * np.identity(w.shape[0])

        return hess + regularization

By debugging this function, i saw that there is a fundamental problem. For small values of the number of features(say less than 200), the two implementations of the hessian are not equal. When i increase the number of features, the two functions seems to be equal. The problem is that when testing those implementations using Newton's method to optimize the log loss, the faster implementations converges in more iterations than the first(but slower in terms of runtime) implementation.

Solution

Here is a vectorized version of the same:

a = np.exp(y * (X @ w))

## Method 1
b = np.sqrt(a)*X.T/(1 + a)
b @ b.T 

## method 2
X.T @ np.diag(a/(1 + a)**2) @ X 

## method 3
np.einsum('ij,i,ik',X, a/(1 + a)**2, X)

Edit

The formula given in the link you provided is incorrect. Note that the computed second derivative of the log likelihood with respect to the weights does not involve the labels. So you do not need the labels. here is a link of the correct hessian for logistic regression.

Now with this correct formula, we can compute the hessian as:

 a = np.exp(X @ w)
 - X.T @ np.diag(a/(1 + a)**2) @ X

Real Example:

import pandas as pd, numpy as np
dat = pd.read_csv("https://raw.githubusercontent.com/oonyambu/ECON430/main/admit.csv", index_col = 0)

X = np.c_[np.ones(dat.shape[0]), dat.iloc[:,1:3].to_numpy()]
y = dat.admit.to_numpy()

from statsmodels.formula.api import logit
mod = logit("admit~gre+gpa", dat).fit()

# Hessian = -inverse(covariance)
np.linalg.inv(-mod.normalized_cov_params)
array([[-8.25030780e+01, -4.98883263e+04, -2.84169634e+02],
       [-4.98883263e+04, -3.11768300e+07, -1.72965203e+05],
       [-2.84169634e+02, -1.72965203e+05, -9.89840314e+02]])

mod.model.hessian(mod.params)
array([[-8.25030780e+01, -4.98883263e+04, -2.84169634e+02],
       [-4.98883263e+04, -3.11768300e+07, -1.72965203e+05],
       [-2.84169634e+02, -1.72965203e+05, -9.89840314e+02]]) 

## Direct Computation. (No y)
a = np.exp(X @ mod.params)
b = np.sqrt(a)*X.T/(1 + a)
-b @ b.T
array([[-8.25030780e+01, -4.98883263e+04, -2.84169634e+02],
       [-4.98883263e+04, -3.11768300e+07, -1.72965203e+05],
       [-2.84169634e+02, -1.72965203e+05, -9.89840314e+02]])

or even

k = 1/(1 + np.exp(-X @ mod.params))
-k*(1-k)*X.T @ X
array([[-8.25030780e+01, -4.98883263e+04, -2.84169634e+02],
       [-4.98883263e+04, -3.11768300e+07, -1.72965203e+05],
       [-2.84169634e+02, -1.72965203e+05, -9.89840314e+02]])