machine-learning mathematical-optimization linear-regression

Convergence and regularization in linear regression classifier

I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.

$target_function$

Now I am not sure, how to choose the "good" value for a. Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.

Solution

Value of 'a'

Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.

You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.

Data Set Partition

ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.

Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.

Convergence

This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.

The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.

Does any of this help?