python machine-learning scikit-learn pipeline grid-search

Pipeline predict X has a different shape than during fitting

I got stuck with this error that I understand the meaning but I don't know how to deal with it.

Here is what I do:

class PreProcessing(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass

  def transform(self, df):

   #Here i select the features and transform them for exemple:
   age_band=0
   if age<=10
     age_band=1
   else #... etc to 90
     age_band=9
   ....
   other feature engineering
   ....
   encoder = ce.BinaryEncoder(cols=selectedCols)
   encoder.fit(df)
   df = encoder.transform(df)

   return df.as_matrix()

  def fit(self, df, y=None, **fit_params):

    return self

pipe = make_pipeline(PreProcessing(),
                     SelectKBest(f_classif,k=23),
                    RandomForestClassifier())

param_grid = {"randomforestclassifier__n_estimators" : [100,400],
              "randomforestclassifier__max_depth" : [None],
              "randomforestclassifier__max_leaf_nodes": [2,3,5], 
              "randomforestclassifier__min_samples_leaf":[3,5,8],
              "randomforestclassifier__class_weight":['balanced'],
              "randomforestclassifier__n_jobs":[-1]
             }

grid_search = GridSearchCV(pipe,param_grid,cv=5,scoring='recall',verbose=1,n_jobs=15)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

grid_search.fit(X_train,y_train)
grid_search.predict(X_test)

filename = 'myModel.pk'
with open(filename, 'wb') as file:
    pickle.dump(grid_search, file)

So here everything works like a charm. But With real world data: (not the train test files)

modelfile = 'MyModel.pk'
with open(modelfile,'rb') as f:
    loaded_model = pickle.load(f)

print("The model has been loaded...doing predictions now...")
predictions = loaded_model.predict(df)

I got the error: ValueError: X has a different shape than during fitting.

What I understand is that not all modalities are represented on my "real file", because imagine in my train file I have the column "couple" with values "yes, no, I don't know" then the ce.BinaryEncoder will create as many columns needed to store all modalities as binary. But on my real life file that I have to make predictions I have only for these column "couple" values "yes, no" So at the end, X doesn't have the same shape as during the fit... So the only thing I assume to do is to create in PreProcessing all missing modalities with cols value 0...

I think I'm missing something.

Note : the training and test files are from a certain data souce. The data that i need to predict are from an other source, so i first "transform" thoses real datas to the same X_train/Test format, and then i do the model.predit(df). So iam sure before the BinaryEncoder i have the same number of cols (17) on Preprocessing.transform() but after the BinaryEncoder executed if i log the shape of df while running model.predict(X_test) it show df is 41 cols, and on model.predict(realData) only 31 cols.

Solution

This seems to be a problem with your "feature selection/creation" process. You're re-fitting a BinaryEncoder each time a new set of inputs gets passed to your pipeline. This means that any time you have a different number of unique values in the specified column, your code will break with this error.

My guess is that if you save the BinaryEncoder as part of the PreProcessing instance, this won't be an issue assuming that your training data has every possible value that this column can take on.

class PreProcessing(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.encoder = ce.BinaryEncoder(cols=selectedCols)

  def fit(self, df, **kwargs):
    self.encoder.fit(df)

  def transform(self, df):
    # ...
    # No fitting, just transform
    df = self.encoder.transform(df)
    return df

Better yet, could you just insert the BinaryEncoder into your Pipeline, and leave it out of PreProcessing entirely?

pipe = make_pipeline(PreProcessing(),
                     BinaryEncoder(cols=selectedCols),
                     SelectKBest(f_classif,k=23),
                     RandomForestClassifier())