python-3.x scikit-learn cross-validation

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model

The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.

If I replace the code described above, I get the following error:

  FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: All arrays must be equal length.

  FitFailedWarning)

1) Should I define a score function in the Class RBF?

2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.

3) Is there something else missing?

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin



class RBF(BaseEstimator, RegressorMixin):
    def __init__(self,function):
        self.function=function
    def fit(self,x,y):
        self.rbf = Rbf(x, y,function=self.function)
    def predict(self,x):   
        return self.rbf(x)    


if __name__ == "__main__":
    # Load Data
    targetName='HousePrice'
    data=datasets.load_boston()
    featuresNames=list(data.feature_names)
    featuresData=data.data
    targetData = data.target
    df=pd.DataFrame(featuresData,columns=featuresNames)
    df[targetName]=targetData
    independent_variable_list=featuresNames
    dependent_variable=targetName
    X=df[independent_variable_list].values
    y=np.squeeze(df[[dependent_variable]].values)    
    # Model Definition    
    model=linear_model.LinearRegression()
    #model=RBF('multiquadric')    
    # Cross validation routine
    number_splits=5
    score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
    kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
    scalar = StandardScaler()
    pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
    results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
    for score in score_list:
        print(score+':')
        print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
        print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))

Solution

Lets look at the documentation here

*args : arrays

x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes

So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….

Following the documentation, your code should be

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin

class RBF(BaseEstimator, RegressorMixin):
    def __init__(self,function):
        self.function=function
    def fit(self,X,y):        
        self.rbf = Rbf(*X.T, y,function=self.function)

    def predict(self,X):   
        return self.rbf(*X.T)


# Load Data
data=datasets.load_boston()

X = data.data
y = data.target


number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']

kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()

model = RBF(function='multiquadric')

pipeline = Pipeline([('transformer', scalar), ('estimator', model)])

results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)

for score in score_list:
        print(score+':')
        print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
        print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))

Output

neg_mean_squared_error:
Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
Test: Mean -23.007377210596463 Standard Error 4.254629143836107
neg_mean_absolute_error:
Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
r2:
Train: Mean 1.0 Standard Error 0.0
Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363

Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.

Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.