Consider the cost function with regularization in machine learning:
Why will the parameter θ towards to zero when we set the parameter λ to be very large?
The regularized cost function is penalized by the size of the parameters θ.
The regularization term dominates the cost in case λ → +inf
It is worth noting that when λ is very large, most of the cost will be coming from the regularization term λ * sum (θ²)
and not the actual cost sum((h_θ - y)²)
, hence in that case it's mostly about minimizing the regularization term λ * sum (θ²)
by tending θ towards 0 (θ → 0
)
Why minimizing λ * sum (θ²)
results in θ → 0
Consider the regularization term λ * sum (θ²)
, to minimize this term the only solution is to push sum(θ²) → 0
. (λ
is a positive constant, and the sum
term is also positive)
And since θ
terms are squared (θ²
is always positive), the only way is to push the θ
parameters towards 0. Hence sum(θ²) → 0
means θ → 0
So to sum up, in this case of very large λ:
Minimizing the cost function is mostly about minimizing λ * sum (θ²)
, which requires minimizing sum (θ²)
, which requires θ → 0
Some intuition to answer the question in the comment:
Think of λ as a parameter for you to tell how much of a regularization you want to happen. E.g. if on the extreme you set λ to 0, then your cost function is not regularized at all. If you set λ to a lower number then you get less of a regularization.
And vice versa, the more you increase λ, the more your asking your cost function to regularized, so the smaller the parameters θ will have to be in order to minimize the regularized cost function.
Why do we use θ² in the regularization sum rather than θ?
Because the goal is to have small θ (less prone to overfitting).
If the regularization term uses θ instead of θ² in the sum,
you can end up with large θ values that cancel each other,
e.g. θ_1 = 1000000 and θ_2 = -1000001, the sum(θ)
here is -1 which is small, vs if you took sum(|θ|)
(absolute value) or sum(θ²)
(squared) you'd end up with a very big value.
In that case you may end up overfitting because of large θ values that escaped the regularization because the terms cancel each other out.