How does having smaller values for parameters help in preventing over-fitting?

To reduce the problem of over-fitting in linear regression in machine learning , it is suggested to modify the cost function by including squares of parameters. This results in smaller values of the parameters.

This is not at all intuitive to me. How can having smaller values for parameters result in simpler hypothesis and help prevent over-fitting?

Solution

This is a bit more complicated. It depends very much on the algorithm you are using.

To make an easy but slightly stupid example. Instead of optimising the parameter of the function

  y = a*x1 + b*x2

you could also optimising the parameters of

  y = 1/a * x1 + 1/b * x2

Obviously if you minimise in the former case the you need to maximise them in the latter case.

The fact that for most algorithm minimising the square of the parameters comes from computational learning theory.

Let's assume for the following you want to learn a function

 f(x) = a + bx + c * x^2 + d * x^3 +....

One can argue that a function were only a is different from zero is more likely than a function, where a and b are different form zero and so on. Following Occams razor (If you have two hypothesis explaining your data, the simpler is more likely the right one), you should prefer a hypothesis where more of you parameters are zero.

To give an example lets say your data points are (x,y) = {(-1,0),(1,0)} Which function would you prefer

f(x) = 0

f(x) = -1 +  1*x^2

Extending this a bit you can go from parameters which are zero to parameters which are small.

If you want to try it out you can sample some data points from a linear function and add a bit of gaussian noise. If you want to find a perfect polynomial fit you need a pretty complicated function with typically pretty large weights. However, if you apply regularisation you will come close to your data generating function.

But if you want to set your reasoning on rock-solid theoretical foundations I would recommend to apply Baysian statistics. The idea there is that you define a probability distribution over regression functions. That way you can define yourself what a "probable" regression function is.

(Actually Machine Learning by Tom Mitchell contains a pretty good and more detailed explanation)