python machine-learning logistic-regression gradient-descent divide-by-zero

MLE Log-likelihood for logistic regression gives divide by zero error

I want to compute the log-likelihood of a logistic regression model.

def sigma(x):
    return 1 / (1 + np.exp(-x))

def logll(y, X, w):
    """"
    Parameters
    y : ndarray of shape (N,)
        Binary labels (either 0 or 1).
    X : ndarray of shape (N,D)
        Design matrix.
    w : ndarray of shape (D,)
        Weight vector.
    """
    p = sigma(X @ w)
    y_1 = y @ np.log(p)
    y_0 = (1 - y) @ (1 - np.log(1 - p))
    return y_1 + y_0

logll(y, Xz, np.linspace(-5,5,D))

Applying this function results in

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:16: 
RuntimeWarning: divide by zero encountered in log
  app.launch_new_instance()

I would expect y_0 to be a negative float. How can I avoid this error and is there a bug somewhere in the code?

Edit 1

X @ w statistics:
Max: 550.775133944
Min: -141.972597608
Sigma(max): 1.0 => Throws error in y_0 in np.log(1 - 1.0)
Sigma(min): 2.19828642169e-62

Edit 2

I also have access to this logsigma function that computes sigma in log space:

def logsigma (x):
   return np.vectorize(np.log)(sigma(x))

Unfortunately, I don't find a way to rewrite y_0 then. The following is my approach but obviously not correct.

def l(y, X, w):
    y_1 = np.dot(y, logsigma(X @ w))
    y_0 = (1 - y) @ (1 - np.log(1 - logsigma(X @ w)))
    return y_1 + y_0

Solution

First of all, I think you've made a mistake in your log-likelihood formula: it should be a plain sum of y_0 and y_1, not sum of exponentials:

Division by zero can be caused by large negative values (I mean large by abs value) in X @ w, e.g. sigma(-800) is exactly 0.0 on my machine, so the log of it results in "RuntimeWarning: divide by zero encountered in log".

Make sure you initialize your network with small values near zero and you don't have exploding gradients after several iterations of backprop.

By the way, here's the code I use for cross-entropy loss, which works also in multi-class problems:

def softmax_loss(x, y):
  """
  - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
    for the ith input.
  - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
    0 <= y[i] < C
  """
  probs = np.exp(x - np.max(x, axis=1, keepdims=True))
  probs /= np.sum(probs, axis=1, keepdims=True)
  N = x.shape[0]
  return -np.sum(np.log(probs[np.arange(N), y])) / N

UPD: When nothing else helps, there is one more numerical trick (discussed in the comments): compute log(p+epsilon) and log(1-p+epsilon) with a small positive epsilon value. This ensures that log(0.0) never happens.