Search code examples
pythonscikit-learnpipeline

Scikit-learn Pipeline - Execution order of transformers


I am working on a text classifier for which I want to do the following

  1. Create new features on the text (like number of words, number of hash tags, etc) with a customer transformer TextCounts
  2. Clean the text with a custom transformer CleanText and apply CountVectorizer on it
  3. Combine the features of step 1 and 2 as input for my classifier

I managed to create a Pipeline for this, but I am not sure whether it runs like explained above.

features = FeatureUnion(n_jobs=-1,
    [('textcounts', TextCounts())
    , Pipeline([
        ('cleantext', CleanText())
        , ('vect', vect)
        ])
    ])

pipeline = Pipeline([
    ('features', features)
    , ('clf', clf)
])

In fact, I am not sure whether the CountVectorizer is being applied on the cleaned text or the original text. Is there a way to figure that out? Thanks!


Solution

  • The steps within the FeatureUnion will be applied in parallel (as you allow as many jobs as you have cores with n_jobs=-1, even actually in parallel). So yes, the CountVectorizer will be applied to the cleaned text.

    I think the graphics in this blog post make it quite clear.

    Regarding "Is there a ways to find out?", see my answer here for further questions.