Search code examples
python-2.7machine-learningscikit-learnpipelinefeature-extraction

How to make FeatureUnion return Dataframe


So I currently have a Pipeline that has a lot of customer transformers:

p = Pipeline([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

Each transformer takes in a pandas dataframe and returns the same dataframe with one or more new columns. It actually works quite well, but how can I run the "GetTimeFromDate" and the "GetZipFromAddress" steps in parallel?

I would like to use FeatureUnion:

f = FeatureUnion([  
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column])
])

p = Pipeline([
("FeatureUnionStep",f),
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

But the problem is that FeatureUnion returns a numpy.ndarray, but the "GroupByTimeandZip" step needs a dataframe.

Is there a way I can get FeatureUnion to return a pandas dataframe?


Solution

  • For a FeatureUnion to output a DataFrame you can use the PandasFeatureUnion from this blog post. Also see the gist.