python r machine-learning regression glmnet

ElasticNetCV in Python vs cvglmnet in R

Has anybody tried to rich same results by implementing ElasticNetCV in Python and cvglmnet in R? I have found out how to make it on ElasticNet in Python and glmnet in R but cannot reproduce it with cross validation methods...

Steps to reproduce in Python:

Preprocessing:

from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

data = make_regression(
    n_samples=100000,
    random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

pd.DataFrame(X_train).to_csv('X_train.csv', index=None)
pd.DataFrame(X_test).to_csv('X_test.csv', index=None)
pd.DataFrame(y_train).to_csv('y_train.csv', index=None)
pd.DataFrame(y_test).to_csv('y_test.csv', index=None)

Models:

model = ElasticNet(
    alpha=1.0,
    l1_ratio=0.5,
    fit_intercept=True,
    normalize=True,
    precompute=False,
    max_iter=100000,
    copy_X=True,
    tol=0.0000001,
    warm_start=False,
    positive=False,
    random_state=0,
    selection='cyclic'
)

model.fit(
    X=X_train,
    y=y_train
)

y_pred = model.predict(
    X=X_test
)

print(
    mean_squared_error(
        y_true=y_test,
        y_pred=y_pred
    )
)

output: 42399.49815189786

model = ElasticNetCV(
    l1_ratio=0.5,
    eps=0.001,
    n_alphas=100,
    alphas=None,
    fit_intercept=True,
    normalize=True,
    precompute=False,
    max_iter=100000,
    tol=0.0000001,
    cv=10,
    copy_X=True,
    verbose=0,
    n_jobs=-1,
    positive=False,
    random_state=0,
    selection='cyclic'
)

model.fit(
    X=X_train,
    y=y_train
)

y_pred = model.predict(
    X=X_test
)

print(
    mean_squared_error(
        y_true=y_test,
        y_pred=y_pred
    )
)

output: 39354.729173913176

Steps to reproduce in R:

Preprocssing:

library(glmnet)
X_train <- read.csv(path)
X_test <- read.csv(path)
y_train <- read.csv(path)
y_test <- read.csv(path)
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error = y_test - y_pred
mean(as.matrix(y_error)^2)

output: 42399.5

fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)

output: 37.00207

Solution

Thanks so much for providing the example, I am on a laptop so I had to reduce the number of samples to 100:

from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

data = make_regression(
    n_samples=100,
    random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

When you do predict in with glmnet, you need to specify lambda, otherwise it returns the predictions for all lambdas, so in R:

fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
dim(y_pred)
[1] 25 89

When you run cv.glmnet, it selects the best lambda from cv, the lambda.1se, so it gives you only 1 set, which is the rmse you wanted:

fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 22.03504

dim(y_error)
[1] 25  1
fit$lambda.1se
[1] 1.278699

If we select the lambda closest to that chosen by cv.glmnet in glmnet, you get back something in the correct range:

fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
sel = which.min(fit$lambda-1.278699)
y_pred <- predict(fit, newx = as.matrix(X_test))[,sel]
mean((y_test - y_pred)^2)
dim(y_error)

mean(as.matrix((y_test - y_pred)^2))
[1] 20.0775

Before we compare with sklearn, we need to make sure we are testing over the same range of lambdas.

L = c(0.01,0.05,0.1,0.2,0.5,1,2)
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train),lambda=L)
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 0.003065869

So we expect something in the range of 0.003065869. We run it with the same lambda, lambda is termed as alpha in ElasticNet. The alpha in glmnet is in fact your l1_ratio, see vignette. And the normalize option should be set to False, because:

If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

So we just run it using CV:

model = ElasticNetCV(l1_ratio=1,fit_intercept=True,alphas=[0.01,0.05,0.1,0.2,0.5,1,2])
model.fit(X=X_train,y=y_train)
y_pred = model.predict(X=X_test)
mean_squared_error(y_true=y_test,y_pred=y_pred)

0.0018007824874741929

It's around the same ball park as the R result.

And if you do it for ElasticNet, you will get the same result, if you specify alpha.