Search code examples
machine-learningmathematical-optimization

In the regularization,why we use θ^2 rather than θ?


enter image description here

The regularization is lambda*sum(θ^2)


Solution

  • I've already answered this in your previous question (see last paragraph), but I'll try again.

    The problem regularizing with sum(θ) is that you may have θ parameters that cancel each other

    Example:

    θ_1 = +1000000
    θ_2 = -1000001
    

    The sum(θ) here is +1000000 -1000001 = -1 which is small

    The sum(θ²) is 1000000² + (-1000001)² which is very big.

    If you use sum(θ) you may end up without regularization (which was the goal) because of large θ values that escaped the regularization because the terms cancel each other out.

    You may use sum(|θ|) depending on your search/optimisation algorithm. But I know θ² (L2 norm) to be popular and works well with gradient descent.