Search code examples
scikit-learnpipelinepost-processing

How to define a pipeline in python with three steps: preprocessing, predicting and postprocessing?


I am trying to define a pipeline in python using sklearn.pipeline.Pipeline to perform 3 steps: pre-processing, prediction and post-processing. The ultimate goal is to define a Google Cloud Function where I just pass the joblib model and get the predicted label and predicted probability for this label.

I succeeded in defining the pipeline with the first 2 steps and it works fine. However, when I try to include the third (post-processing) step I get error messages. I have tried various approach and get different error messages. In the following code, if I remove ('proba', FunctionTransformer(findProba()) from the pipeline everything works fine. I can't seem to figure out how I could include the postprocessing step into my pipeline.

Scikit-learn defines the pipeline class (see https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) as:

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

Reading this definition I start wondering whether it is possible to include a step after the estimator. But in my case, I really need to be able to return the class (konto in my case) and the probability of getting that case (proba). If I stop after the second step, I won't be able to compute and return the probability during online prediction.

I include a summary of the code to show what I am doing:

from nltk import word_tokenize
from nltk.corpus import stopwords
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from datetime import date
import time 

import warnings
warnings.filterwarnings('ignore')


def findProba(model,Input_Text):
    Input_Text = [Input_Text]
    Y_predicted = model.predict(Input_Text)
    Y_predict_proba = model.predict_proba(Input_Text)
    max_proba_rows = np.amax(Y_predict_proba, axis=1)*100
    round_off_proba = np.around(max_proba_rows, decimals = 1)
    d = dict()
    d['Konto'] = Y_predicted[0]
    d['proba'] = round_off_proba[0]
    return d


df_total = pd.read_csv('dataset_mars2019_trimmed_mapped.csv')
df=df_total.sample(frac=0.001, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(df['Input_Data'], df['LABEL'], random_state = 0, test_size=0.25)

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
                     ('proba', FunctionTransformer(findProba()),
])

 _ = text_clf.fit(X_train, y_train)


from sklearn.externals import joblib
joblib.dump(text_clf, 'model.joblib')

Solution

  • The semantic of sklearn.pipeline.Pipeline is the following: a sequence of transformers (i.e. implementing fit and transform) followed by a final predictor (i.e. implementing fit and predict (optionally predict_proba,decision_function, etc.).

    Since all scikit-learn metrics expect only either predict or predict_proba output, it will not be straightforward to do what you like.

    I think that the easiest way is to implement your own meta-estimator which will make what you want:

    from sklearn.base import BaseEstimator
    class PostProcessor(BaseEstimator):
        def __init__(self, predictor):
            self.predictor = predictor
        def fit(self, X, y):
            self.predictor.fit(X, y)
        def predict(self, X):
            y_pred = self.predictor.predict(X)
            y_pred_proba = self.predictor.predict_proba(X)
            # do something with those
            return np.hstack([y_pred, y_pred_proba])