Search code examples
pythonmachine-learningscikit-learnpipelinefeature-extraction

Include feature extraction in pipeline sklearn


For a text classification project I made a pipeline for the feature selection and the classifier. Now my question is if it is possible to include the feature extraction module in the pipeline and how. I looked some things up about it, but it doesn't seem to fit with my current code.

This is what I have now:

# feature_extraction module.  
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn.feature_extraction import DictVectorizer  
import numpy as np

vec = DictVectorizer() 
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set 
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale

enc = LabelEncoder()
y = enc.fit_transform(labels)

# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline

feat_sel = SelectKBest(mutual_info_classif, k=200)  
clf = linear_model.LogisticRegression() 
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)])) 
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)

How can I put the dictvectorizer until the label encoder in the pipeline?


Solution

  • Here's how you would do it. Assuming instances is a dict-like object, as specified in the API, then just build your pipeline like so:

    pipe = Pipeline([('vectorizer', DictVectorizer()),
                     ('scaler', StandardScaler(with_mean=False)),
                     ('mutual_info', feat_sel),
                     ('logistregress', clf)])
    

    To predict, then call cross_val_predict, passing instances as X:

    y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)