I am trying to define a pipeline in python using sklearn.pipeline.Pipeline to perform 3 steps: pre-processing, prediction and post-processing. The ultimate goal is to define a Google Cloud Function where I just pass the joblib model and get the predicted label and predicted probability for this label.
I succeeded in defining the pipeline with the first 2 steps and it works fine. However, when I try to include the third (post-processing) step I get error messages. I have tried various approach and get different error messages.
In the following code, if I remove ('proba', FunctionTransformer(findProba())
from the pipeline everything works fine. I can't seem to figure out how I could include the postprocessing step into my pipeline.
Scikit-learn defines the pipeline class (see https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) as:
Pipeline of transforms with a final estimator.
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.
Reading this definition I start wondering whether it is possible to include a step after the estimator. But in my case, I really need to be able to return the class (konto in my case) and the probability of getting that case (proba). If I stop after the second step, I won't be able to compute and return the probability during online prediction.
I include a summary of the code to show what I am doing:
from nltk import word_tokenize
from nltk.corpus import stopwords
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from datetime import date
import time
import warnings
warnings.filterwarnings('ignore')
def findProba(model,Input_Text):
Input_Text = [Input_Text]
Y_predicted = model.predict(Input_Text)
Y_predict_proba = model.predict_proba(Input_Text)
max_proba_rows = np.amax(Y_predict_proba, axis=1)*100
round_off_proba = np.around(max_proba_rows, decimals = 1)
d = dict()
d['Konto'] = Y_predicted[0]
d['proba'] = round_off_proba[0]
return d
df_total = pd.read_csv('dataset_mars2019_trimmed_mapped.csv')
df=df_total.sample(frac=0.001, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(df['Input_Data'], df['LABEL'], random_state = 0, test_size=0.25)
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
('clf', MultinomialNB()),
('proba', FunctionTransformer(findProba()),
])
_ = text_clf.fit(X_train, y_train)
from sklearn.externals import joblib
joblib.dump(text_clf, 'model.joblib')
The semantic of sklearn.pipeline.Pipeline
is the following: a sequence of transformers (i.e. implementing fit
and transform
) followed by a final predictor (i.e. implementing fit
and predict
(optionally predict_proba
,decision_function
, etc.).
Since all scikit-learn metrics expect only either predict
or predict_proba
output, it will not be straightforward to do what you like.
I think that the easiest way is to implement your own meta-estimator which will make what you want:
from sklearn.base import BaseEstimator
class PostProcessor(BaseEstimator):
def __init__(self, predictor):
self.predictor = predictor
def fit(self, X, y):
self.predictor.fit(X, y)
def predict(self, X):
y_pred = self.predictor.predict(X)
y_pred_proba = self.predictor.predict_proba(X)
# do something with those
return np.hstack([y_pred, y_pred_proba])