Search code examples
pythonscikit-learnpipeline

Binding outputs of transformers in FeatureUnion


New to python and sklearn so apologies in advance. I have two transformers and I would like to gather the results in a `FeatureUnion (for a final modelling step at the end). This should be quite simple but FeatureUnion is stacking the outputs rather than providing an nx2 array or DataFrame. In the example below I will generate some data that is 10 rows by 2 columns. This will then generate two features that are 10 rows by 1 column. I would like the final feature union to have 10 rows and 1 column but what I get are 20 rows by 1 column.

I will try to demonstrate with my example below:

some imports

import numpy as np
import pandas as pd
from sklearn import pipeline
from sklearn.base import TransformerMixin

some random data

df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])

a custom transformer that selects a column

class Trans(TransformerMixin):
    def __init__(self, col_name):
        self.col_name = col_name
    def fit(self, X):
        return self                                                                    
    def transform(self, X):                                                           
        return X[self.col_name]

a pipeline that uses the transformer twice (in my real case I have two different transformers but this reproduces the problem)

pipe = pipeline.FeatureUnion([
    ('select_a', Trans('a')),
    ('select_b', Trans('b'))
    ])

now i use the pipeline but it returns an array of twice the length

pipe.fit_transform(df).shape

(20,)

however I would like an array with dimensions (10, 2).

Quick fix?


Solution

  • The transformers in the FeatureUnion need to return 2-dimensional matrices, however in your code by selecting a column, you are returning a 1-dimensional vector. You could fix this by selecting the column with X[[self.col_name]].