Search code examples
pythonnumpyscikit-learndill

Using numpy in sklearn FunctionTransformer inside pipeline


I'm training a regression model and inside my pipeline I have something like this:

best_pipeline = Pipeline(
    steps=[
        (
            "features",
            ColumnTransformer(
                transformers=[
                    (
                        "area",
                        make_pipeline(
                            impute.SimpleImputer(),
                            pr.FunctionTransformer(lambda x: np.log1p(x)),
                            StandardScaler(),
                        ),
                        ["area"],
                    )
                ]
            ),
        ),
        (
            "regressor",
            TransformedTargetRegressor(
                regressor=model,
                transformer=PowerTransformer(method='box-cox')
            ),
        ),
    ]
)

There are obviously more features but the code will be too long. So I train the model and if I predict in the same script everything is fine. I store the model using dill and then try to use it in another python file.

In this other file I load the model and try this:

import numpy as np
df['prediction'] = self.model.predict(df)

And internally, when it tries to do the transform it returns:

NameError: name 'np' is not defined

Solution

  • You can use third-party library functions by simply passing the name of the function as a func argument:

    import numpy
    
    transformer = FunctionTransformer(numpy.log1p)
    

    There is no need for lambdas or custom wrapper classes. Also, the above solution is persistable in plain pickle data format.

    When porting objects between different environments, then it's probably a good idea to use canonical module names. Hence numpy.log1p instead of np.log1p.