Search code examples
pythonpandaspipeline

FunctionTransformer & creating new columns in pipeline


I have a sample data:

df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
                                               [1,16,9],
                                               [4,36,16],
                                               [1,16,9],
                                               [2,9,8],
                                               [3,36,15],
                                               [2,49,16],
                                               [4,25,14],
                                               [5,36,17]])

I want to create two complementary columns in my df based on x2 ad X3 and include it in the pipeline.

I am trying to follow the code:

def feat_comp(x):
 x1 = 100-x
 return x1

pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, 'X2',validate=False))])
X = pipe_text.fit_transform(df)

It gives me an error:

TypeError: 'str' object is not callable

How can I apply the function transformer on selected columns and how can I use them in the pipeline?


Solution

  • If I understand you correctly, you want to add a new column based on a given column, e.g. X2. You need to pass this column as an additional argument to the function using kw_args:

    import pandas as pd
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.pipeline import Pipeline
    
    df = pd.DataFrame(columns=['X1', 'X2', 'X3'], data=[
                                                   [1,16,9],
                                                   [4,36,16],
                                                   [1,16,9],
                                                   [2,9,8],
                                                   [3,36,15],
                                                   [2,49,16],
                                                   [4,25,14],
                                                   [5,36,17]])
    
    def feat_comp(x, column):
       x[f'100-{column}'] = 100 - x[column]
       return x
    
    pipe_text = Pipeline([('col_test', FunctionTransformer(feat_comp, validate=False, kw_args={'column': 'X2'}))])
    pipe_text.fit_transform(df)
    

    Result:

       X1  X2  X3  100-X2
    0   1  16   9      84
    1   4  36  16      64
    2   1  16   9      84
    3   2   9   8      91
    4   3  36  15      64
    5   2  49  16      51
    6   4  25  14      75
    7   5  36  17      64
    

    (in your example FunctionTransformer(feat_comp, 'X2',validate=False) X2 would be the inverse_func and the string X2 is not callalble, hence the error)