Search code examples
python-3.xscikit-learncross-validation

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model


The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.

If I replace the code described above, I get the following error:

  FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: All arrays must be equal length.

  FitFailedWarning)

1) Should I define a score function in the Class RBF?

2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.

3) Is there something else missing?

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin



class RBF(BaseEstimator, RegressorMixin):
    def __init__(self,function):
        self.function=function
    def fit(self,x,y):
        self.rbf = Rbf(x, y,function=self.function)
    def predict(self,x):   
        return self.rbf(x)    


if __name__ == "__main__":
    # Load Data
    targetName='HousePrice'
    data=datasets.load_boston()
    featuresNames=list(data.feature_names)
    featuresData=data.data
    targetData = data.target
    df=pd.DataFrame(featuresData,columns=featuresNames)
    df[targetName]=targetData
    independent_variable_list=featuresNames
    dependent_variable=targetName
    X=df[independent_variable_list].values
    y=np.squeeze(df[[dependent_variable]].values)    
    # Model Definition    
    model=linear_model.LinearRegression()
    #model=RBF('multiquadric')    
    # Cross validation routine
    number_splits=5
    score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
    kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
    scalar = StandardScaler()
    pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
    results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
    for score in score_list:
        print(score+':')
        print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
        print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))

Solution

  • Lets look at the documentation here

    *args : arrays

    x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes

    So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….

    Following the documentation, your code should be

    from sklearn import datasets
    import numpy as np
    import pandas as pd
    from sklearn import linear_model
    from sklearn import model_selection
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from scipy.interpolate import Rbf
    np.random.seed(0)
    from sklearn.base import BaseEstimator, RegressorMixin
    
    class RBF(BaseEstimator, RegressorMixin):
        def __init__(self,function):
            self.function=function
        def fit(self,X,y):        
            self.rbf = Rbf(*X.T, y,function=self.function)
    
        def predict(self,X):   
            return self.rbf(*X.T)
    
    
    # Load Data
    data=datasets.load_boston()
    
    X = data.data
    y = data.target
    
    
    number_splits=5
    score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
    
    kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
    scalar = StandardScaler()
    
    model = RBF(function='multiquadric')
    
    pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
    
    results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
    
    for score in score_list:
            print(score+':')
            print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
            print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
    

    Output

    neg_mean_squared_error:
    Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
    Test: Mean -23.007377210596463 Standard Error 4.254629143836107
    neg_mean_absolute_error:
    Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
    Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
    r2:
    Train: Mean 1.0 Standard Error 0.0
    Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
    

    Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.

    Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.