Mismatch between statsmodels and sklearn ridge regression

I'm exploring ridge regression. While comparing statsmodels and sklearn, I found that the two libraries result in different output for ridge regression. Below is an simple example of the difference

import numpy as np
import pandas as pd 
import statsmodels.api as sm
from sklearn.linear_model import Lasso, Ridge

np.random.seed(142131)

n = 500
d = pd.DataFrame()
d['A'] = np.random.normal(size=n)
d['B'] = d['A'] + np.random.normal(scale=0.25, size=n)
d['C'] = np.random.normal(size=n)
d['D'] = np.random.normal(size=n)
d['intercept'] = 1
d['Y'] = 5 - 2*d['A'] + 1*d['D'] + np.random.normal(size=n)

y = np.asarray(d['Y'])
X = np.asarray(d[['intercept', 'A', 'B', 'C', 'D']])

First, using sklearn and Ridge:

ridge = Ridge(alpha=1, fit_intercept=True)
ridge.fit(X=np.asarray(d[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']]), y=y)
ridge.intercept_, ridge.coef_

which outputs 4.99721, -2.00968, 0.03363, -0.02145, 1.02895].

Next, statsmodels and OLS.fit_regularized:

penalty = np.array([0, 1., 1., 1., 1.])
ols = sm.OLS(y, X).fit_regularized(L1_wt=0., alpha=penalty)
ols.params

which outputs [5.01623, -0.69164, -0.63901, 0.00156, 0.55158]. However, since these both are implementing ridge regression, I would expect them to be the same.

Note, that neither of these penalize the intercept term (already checked that as a possible potential difference). I also don't think this is an error on my part. Specifically, I find both implementations provide the same output for LASSO. Below is a demonstration with the previous data

# sklearn LASSO
lasso = Lasso(alpha=0.5, fit_intercept=True)
lasso.fit(X=np.asarray(d[['A', 'B', 'C', 'D']]), y=y)
lasso.intercept_, lasso.coef_

# statsmodels LASSO
penalty = np.array([0, 0.5, 0.5, 0.5, 0.5])
ols = sm.OLS(y, X).fit_regularized(L1_wt=1., alpha=penalty)
ols.params

which both output [5.01465, -1.51832, 0., 0., 0.57799].

So my question is why do the estimated coefficients for ridge regression differ across implementations in sklearn and statsmodels?

Solution

After digging around a little more, I discovered the answer as to why they differ. The difference is that sklearn's Ridge scales the penalty term as alpha / n where n is the number of observations. statsmodels does not apply this scaling of the tuning parameter. You can have the ridge implementations match if you re-scale the penalty for statsmodels.

Using my posted example, here is how you would have the output match between the two:

# sklearn 
# NOTE: there is no difference from above
ridge = Ridge(alpha=1, fit_intercept=True)
ridge.fit(X=np.asarray(d[['A', 'B', 'C', 'D']]), y=y)
ridge.intercept_, ridge.coef_

# statsmodels
# NOTE: going to re-scale the penalties based on n observations
n = X.shape[0]
penalty = np.array([0, 1., 1., 1., 1.]) / n  # scaling penalties
ols = sm.OLS(y, X).fit_regularized(L1_wt=0., alpha=penalty)
ols.params

Now both output [ 4.99721, -2.00968, 0.03363, -0.02145, 1.02895].

I am posting this, so if someone else finds them in my situation they can find the answer more easily (since I haven't seen any discussion of this difference before). I'm not sure of the rationale for the re-scaling. It is also odd to me that Ridge re-scales the tuning parameter but Lasso does not. Looks like important behavior to be aware of. Reading the sklearn documentation for Ridge and LASSO, I did not see the difference in re-scaling behavior for Ridge discussed.