I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?
Take this text classification flow as an example:
model = Pipeline([
('vectorizer', FeatureUnion(transformer_list=[
('word-freq', TfidfVectorizer()), # vocab-size dimensional
('doc2vec', MyDoc2VecVectorizer()), # 32 dimensional (custom transformer)
('doc-length', MyDocLengthVectorizer()), # 1 dimensional (custom transformer)
('sentiment', MySentimentVectorizer()), # 3 dimensional (custom transformer)
... # possibly many other transformers
])),
('classifier', SVC())
])
I suspect this may fall under the requested dynamic-pipeline
functionality of scikit slep002. If so how to handle in the interim?
While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection
or dimensionality reduction
modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:
model = Pipeline([
('vectorizer', FeatureUnion(transformer_list=[...])),
('feature_selector', GenericUnivariateSelect(
mode='percentile',
param=0.20, # hyper-tunable parameter
)),
('classifier', SVC())
])
In a feature discovery context (ie: find the optimal expressive signals), this technique is more powerful over cherry-picking transformers. However, in an architecture discovery context (ie: find the optimal pipeline layout & use of transformers) this problem seems to remain open..