Search code examples
pythonrr-caretrpy2glmnet

Using rpy2 w/ caret attempts classification instead of regression


I have data that I have created and preprocessed in Python that I would like to import to R and perform a k-fold cross-validated LASSO fit using glmnet. I want control over which observations are used in each fold, so I want to use caret to do this.

However, I have found that caret interprets my data as a classification instead of a regression problem, and promptly fails. Here is what I hope is a reproducible example:

import numpy as np
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import numpy2ri
from rpy2.robjects.conversion import localconverter

pandas2ri.activate()
numpy2ri.activate()

# Import essential R packages
glmnet = importr('glmnet')
caret = importr('caret')
base = importr('base')

# Define X and y input 
dummy_x = pd.DataFrame(np.random.rand(10000, 5), columns=('a', 'b', 'c', 'd', 'e'))
dummy_y = np.random.rand(10000)

# Convert pandas DataFrame to R data.frame
with localconverter(robjects.default_converter + pandas2ri.converter): 
 dummy_x_R = robjects.conversion.py2rpy(dummy_x) 

# Use caret to perform the fit using default settings 
caret_test = caret.train(**{'x': dummy_x_R, 'y': dummy_y, 'method': 'glmnet'})

rpy2 fails, giving this cryptic error message from R:

RRuntimeError: Error: Metric RMSE not applicable for classification models

What could be causing this? According to this previous question, it may be the case that caret is assuming that at least one of my variables is an integer type, and so defaults to thinking this is a classification instead of a regression problem.

However, I have checked both X and y using typeof, and they are clearly doubles:

base.sapply(dummy_x_R, 'typeof')                                                                                                                                                     
>>> array(['double', 'double', 'double', 'double', 'double'], dtype='<U6')

base.sapply(dummy_y, 'typeof')                                                                                                                                                       
>>> array(['double', 'double', 'double', ..., 'double', 'double', 'double'],
      dtype='<U6')

Why am I getting this error? All the default settings to train assume a regression model, so why does caret assume a classification model when used in this way?


Solution

  • In situations like this, the first step is to identify whether the unexpected outcome originated from the Python or rpy2 side, or the R side.

    The conversion from pandas to R, or numpy to R appears to work as expected, as least for array types:

    >>> [x.typeof for x in dummy_x_R]                                                         
    [<RTYPES.REALSXP: 14>,
     <RTYPES.REALSXP: 14>,
     <RTYPES.REALSXP: 14>,
     <RTYPES.REALSXP: 14>,
     <RTYPES.REALSXP: 14>]
    

    I am guessing that this is what you might have done for dummy_y.

    >>> from rpy2.robjects import numpy2ri                                               
    >>> with localconverter(robjects.default_converter + numpy2ri.converter):  
            dummy_y_R = robjects.conversion.py2rpy(dummy_y)
    >>> dummy_y_R.typeof                                                                 
    <RTYPES.REALSXP: 14>
    

    However, a rather subtle conversion detail is at root of the issue. dummy_y_R has a "shape" (attribute dim in R), while caret expects a shape-less R array (a "vector" in R lingo) in order to perform a regression. One can force dummy_y to be an R vector with:

    caret_test = caret.train(**{'x': dummy_x_R,
                                'y': robjects.FloatVector(dummy_y),
                                'method': 'glmnet'})