Search code examples
pythonscikit-learngradient-descent

The formula for the optimal learning rate in the SGDClassifier in Scikit-learn


I cannot find the formula for the learning rate of the SGDClassifier in Scikit-learn when the learning_rate='optimal', in the original C++ source code of this same function https://leon.bottou.org/projects/sgd

The formula is mentioned in SGDClassifier in Scikit-learn:

foo+bar

Is this right formula used in the original SGDClassifier code, or did it change when ported to Scikit-learn? Also what does t0 represent exactly (in the docs it's only mentioned that it was determined with a heuristic)?


Solution

  • Lets go through the source code and the formulas.

    Sklearn states the following formula: eta = 1/(alpha* (t+t_0)). On the website of Leon Bottou we find the expression eta = eta_0 / (1 + lambda eta_0 t).

    Lets rewrite the latter formula a bit:

    eta = eta_0 / (1 + lambda eta_0 t)
        = 1 / ( 1/eta_0 + lambda t )
        = 1 / ( lambda * ( 1/eta_0 *  1/lambda  + t)).
    

    If now lambda = alpha and the t_0 from sklearn is the same as 1/(eta_0*alpha), the formulas are the same. Lets now look into the source code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L657

    In line 657 we see that optimal_init = 1.0 / (initial_eta0 * alpha). The optimal_init variable is only a different name for t_0 from our formulas as we see in line 679: eta = 1.0 / (alpha * (optimal_init + t - 1)).

    Hence, the formulas are the same.