machine-learning classification logistic-regression supervised-learning log-likelihood

Optimizing weights in logistic regression ( log likelihood )

In Logistic Regression:

hypothesis function,

                   h(x) = ( 1 + exp{-wx} )^-1

where, w - weights/parameters to be fit or optimized

Cost function ( -ve log likelihood function ) is given as :

For a single training e.g.. (x,y):

         l(w) = y * log ( h(x) ) + (1 - y) * log ( 1 - h(x) )

The goal is to maximize l(w) over all the training examples and thereby estimate w.

Question :

Consider the situation wherein there are many more positive (y=1) training examples than negative (y=0) training examples.

For simplicity:

if we consider only the positive (y=1) examples: Algorithm runs:

           maximize ( l(w) )

          =>  maximize ( y * log ( h(x) ) )

          =>  maximize ( log( h(x) ) )

          =>  maximize ( h(x) ); since log(z) increases with z

          =>  maximize ( ( 1 + exp{-wx} )^-1 )

          =>  maximize ( wx );   
   since a larger wx will increase h(x) and move it closer to 1

In other words, the optimization algorithm will try to increase (wx) so as to better fit the data and increase likelihood.

However, it seems possible that there is an unintended way for the algorithm to increase (wx) but not improve the solution ( decision boundary ) in anyway :

             by scaling w: w' = k*w  ( where k is positive constant )

We can increase (k*wx) without changing our solution in anyway.

1) Why is this not a problem ? Or is this a problem ?

2) One can argue that in a dataset with many more positive examples than negative examples, the algorithm will try to keep increasing the ||w||.

Solution

This is a problem sometimes, but it is solved by regularization
Only if the classes are perfectly separated

If there are only y=1, algorithm will indeed try to make wx as large as possible, and never converge. But if you have only one class, you don't need logistic regression at all.

If the dataset is imbalanced (there are much more y=1 than y=0), in general, logistic regression would suffer no convergence problems.

Let's see why. Suppose you have only 1 negative example x_0, and N identical positive examples x_1. Then log-likelihood would look like

l(w) = N * log(h(x_1)) + log(1-h(x_0))

The h(x) is bounded between 0 and 1, so the both components are bounded above by 0, but unbounded from below.

Now, if w is large enough and you keep increasing it, the first term will increase only marginally (because it is already close to 0), but the second term can decrease very fast (because log(x) tends to minus infinity very fast when x approaches 0). If you increase w unlimitely, l(w) will go to minus infinity. Thus, there is a finite w that maximizes likelihood.

But there is one important exception. It happens when the classes are perfectly separated by some hyperplane (it has little to do with class sizes). In this case, both the first and the second term will tend to 0 while ||w|| tends to infinity.

But if the classes are perfectly separated, you probably don't need logistic regression at all! Its power is in the probabilistic prediction, but in the perfeclty-separated case, prediction might be deterministic! So you could apply, say, an SVM to your data instead.

Or you can solve a regularized problem, maximizing l(w)-lambda*||w||. For example, in scikit-learn logistic regression does exactly this. In this case, if l(w) is sufficiently close to 0, ||w|| will dominate and the objective function will eventually decrease in w.

Thus, a small penalty in the objective function solves are you worries. And this is a widely applied solution, not only in logistic regression, but in linear models (Lasso, Ridge, etc.) and in neural networks.