I cannot find the formula for the learning rate of the SGDClassifier in Scikit-learn when the learning_rate='optimal'
, in the original C++ source code of this same function https://leon.bottou.org/projects/sgd
The formula is mentioned in SGDClassifier in Scikit-learn:
Is this right formula used in the original SGDClassifier
code, or did it change when ported to Scikit-learn
?
Also what does t0 represent exactly (in the docs it's only mentioned that it was determined with a heuristic)?
Lets go through the source code and the formulas.
Sklearn states the following formula: eta = 1/(alpha* (t+t_0))
. On the website of Leon Bottou we find the expression eta = eta_0 / (1 + lambda eta_0 t)
.
Lets rewrite the latter formula a bit:
eta = eta_0 / (1 + lambda eta_0 t)
= 1 / ( 1/eta_0 + lambda t )
= 1 / ( lambda * ( 1/eta_0 * 1/lambda + t)).
If now lambda = alpha
and the t_0
from sklearn is the same as 1/(eta_0*alpha)
, the formulas are the same. Lets now look into the source code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/sgd_fast.pyx#L657
In line 657 we see that optimal_init = 1.0 / (initial_eta0 * alpha)
. The optimal_init
variable is only a different name for t_0
from our formulas as we see in line 679: eta = 1.0 / (alpha * (optimal_init + t - 1))
.
Hence, the formulas are the same.