I have data that I have created and preprocessed in Python that I would like to import to R and perform a k-fold cross-validated LASSO fit using glmnet
. I want control over which observations are used in each fold, so I want to use caret
to do this.
However, I have found that caret
interprets my data as a classification instead of a regression problem, and promptly fails. Here is what I hope is a reproducible example:
import numpy as np
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import numpy2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
numpy2ri.activate()
# Import essential R packages
glmnet = importr('glmnet')
caret = importr('caret')
base = importr('base')
# Define X and y input
dummy_x = pd.DataFrame(np.random.rand(10000, 5), columns=('a', 'b', 'c', 'd', 'e'))
dummy_y = np.random.rand(10000)
# Convert pandas DataFrame to R data.frame
with localconverter(robjects.default_converter + pandas2ri.converter):
dummy_x_R = robjects.conversion.py2rpy(dummy_x)
# Use caret to perform the fit using default settings
caret_test = caret.train(**{'x': dummy_x_R, 'y': dummy_y, 'method': 'glmnet'})
rpy2 fails, giving this cryptic error message from R:
RRuntimeError: Error: Metric RMSE not applicable for classification models
What could be causing this? According to this previous question, it may be the case that caret is assuming that at least one of my variables is an integer type, and so defaults to thinking this is a classification instead of a regression problem.
However, I have checked both X and y using typeof
, and they are clearly doubles:
base.sapply(dummy_x_R, 'typeof')
>>> array(['double', 'double', 'double', 'double', 'double'], dtype='<U6')
base.sapply(dummy_y, 'typeof')
>>> array(['double', 'double', 'double', ..., 'double', 'double', 'double'],
dtype='<U6')
Why am I getting this error? All the default settings to train
assume a regression model, so why does caret
assume a classification model when used in this way?
In situations like this, the first step is to identify whether the unexpected outcome originated from the Python or rpy2 side, or the R side.
The conversion from pandas to R, or numpy to R appears to work as expected, as least for array types:
>>> [x.typeof for x in dummy_x_R]
[<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>]
I am guessing that this is what you might have done for dummy_y
.
>>> from rpy2.robjects import numpy2ri
>>> with localconverter(robjects.default_converter + numpy2ri.converter):
dummy_y_R = robjects.conversion.py2rpy(dummy_y)
>>> dummy_y_R.typeof
<RTYPES.REALSXP: 14>
However, a rather subtle conversion detail is at root of the issue. dummy_y_R
has a "shape" (attribute dim
in R), while caret
expects a shape-less R array (a "vector" in R lingo) in order to perform a regression. One can force dummy_y
to be an R vector with:
caret_test = caret.train(**{'x': dummy_x_R,
'y': robjects.FloatVector(dummy_y),
'method': 'glmnet'})