Search code examples
machine-learningscikit-learnfeature-selectiongrid-search

How to feature-select scikit transformers within FeatureUnion


I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?

Take this text classification flow as an example:

model = Pipeline([
   ('vectorizer', FeatureUnion(transformer_list=[
      ('word-freq', TfidfVectorizer()),        # vocab-size dimensional
      ('doc2vec', MyDoc2VecVectorizer()),      # 32 dimensional (custom transformer)
      ('doc-length', MyDocLengthVectorizer()), # 1 dimensional (custom transformer)
      ('sentiment', MySentimentVectorizer()),  # 3 dimensional (custom transformer)
      ...                                      # possibly many other transformers
   ])),
   ('classifier', SVC())
])

I suspect this may fall under the requested dynamic-pipeline functionality of scikit slep002. If so how to handle in the interim?


Solution

  • While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection or dimensionality reduction modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:

    model = Pipeline([
       ('vectorizer', FeatureUnion(transformer_list=[...])),
       ('feature_selector', GenericUnivariateSelect(
          mode='percentile',
          param=0.20,          # hyper-tunable parameter
       )),
       ('classifier', SVC())
    ])
    

    In a feature discovery context (ie: find the optimal expressive signals), this technique is more powerful over cherry-picking transformers. However, in an architecture discovery context (ie: find the optimal pipeline layout & use of transformers) this problem seems to remain open..