optimization machine-learning artificial-intelligence

Regularized cost function with very large λ

Consider the cost function with regularization in machine learning:

Why will the parameter θ towards to zero when we set the parameter λ to be very large?

Solution

The regularized cost function is penalized by the size of the parameters θ.

The regularization term dominates the cost in case λ → +inf

It is worth noting that when λ is very large, most of the cost will be coming from the regularization term λ * sum (θ²) and not the actual cost sum((h_θ - y)²), hence in that case it's mostly about minimizing the regularization term λ * sum (θ²) by tending θ towards 0 (θ → 0)

Why minimizing λ * sum (θ²) results in θ → 0

Consider the regularization term λ * sum (θ²), to minimize this term the only solution is to push sum(θ²) → 0. (λ is a positive constant, and the sum term is also positive)

And since θ terms are squared (θ² is always positive), the only way is to push the θ parameters towards 0. Hence sum(θ²) → 0 means θ → 0

So to sum up, in this case of very large λ:

Minimizing the cost function is mostly about minimizing λ * sum (θ²), which requires minimizing sum (θ²), which requires θ → 0

Some intuition to answer the question in the comment:

Think of λ as a parameter for you to tell how much of a regularization you want to happen. E.g. if on the extreme you set λ to 0, then your cost function is not regularized at all. If you set λ to a lower number then you get less of a regularization.

And vice versa, the more you increase λ, the more your asking your cost function to regularized, so the smaller the parameters θ will have to be in order to minimize the regularized cost function.

Why do we use θ² in the regularization sum rather than θ？

Because the goal is to have small θ (less prone to overfitting). If the regularization term uses θ instead of θ² in the sum, you can end up with large θ values that cancel each other, e.g. θ_1 = 1000000 and θ_2 = -1000001, the sum(θ) here is -1 which is small, vs if you took sum(|θ|) (absolute value) or sum(θ²) (squared) you'd end up with a very big value.

In that case you may end up overfitting because of large θ values that escaped the regularization because the terms cancel each other out.