I am working on a text classifier for which I want to do the following
I managed to create a Pipeline for this, but I am not sure whether it runs like explained above.
features = FeatureUnion(n_jobs=-1,
[('textcounts', TextCounts())
, Pipeline([
('cleantext', CleanText())
, ('vect', vect)
])
])
pipeline = Pipeline([
('features', features)
, ('clf', clf)
])
In fact, I am not sure whether the CountVectorizer is being applied on the cleaned text or the original text. Is there a way to figure that out? Thanks!
The steps within the FeatureUnion will be applied in parallel (as you allow as many jobs as you have cores with n_jobs=-1, even actually in parallel). So yes, the CountVectorizer will be applied to the cleaned text.
I think the graphics in this blog post make it quite clear.
Regarding "Is there a ways to find out?", see my answer here for further questions.