machine-learning scikit-learn feature-selection grid-search

How to feature-select scikit transformers within FeatureUnion

I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?

Take this text classification flow as an example:

model = Pipeline([
   ('vectorizer', FeatureUnion(transformer_list=[
      ('word-freq', TfidfVectorizer()),        # vocab-size dimensional
      ('doc2vec', MyDoc2VecVectorizer()),      # 32 dimensional (custom transformer)
      ('doc-length', MyDocLengthVectorizer()), # 1 dimensional (custom transformer)
      ('sentiment', MySentimentVectorizer()),  # 3 dimensional (custom transformer)
      ...                                      # possibly many other transformers
   ])),
   ('classifier', SVC())
])

I suspect this may fall under the requested dynamic-pipeline functionality of scikit slep002. If so how to handle in the interim?

Solution

While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection or dimensionality reduction modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:

model = Pipeline([
   ('vectorizer', FeatureUnion(transformer_list=[...])),
   ('feature_selector', GenericUnivariateSelect(
      mode='percentile',
      param=0.20,          # hyper-tunable parameter
   )),
   ('classifier', SVC())
])

In a feature discovery context (ie: find the optimal expressive signals), this technique is more powerful over cherry-picking transformers. However, in an architecture discovery context (ie: find the optimal pipeline layout & use of transformers) this problem seems to remain open..