Search code examples
pythonscikit-learnlabel-encoding

Saved model (random forest) doesn't work as "fresh fitted" model - problems with category variables


I built a model in scikit-learn (random forest) and saved it. Then I loaded this model again and tried to apply it to the same data set that was used for training. And I get the error message

"could not convert string to float"

Because I have a couple of category variables. But I was able to apply this model to this data set without errors before I saved the model. The problem seems to be that the information about these couple of category variables was not saved as I saved the model. As a matter of fact I used Labelencoder for these variables. Is there any way to save the information about these category variables so the saved model works as well as "fresh fitted" model? Thanks in advance!


Solution

  • This is a typical use case for pipeline.

    Create your workflow as a single pipeline and then save the pipeline.

    When you load your pipeline, you can get predictions on new data directly without any need for encoding.

    Also, labelEncoder are not meant for transforming input data. As the name suggest, it is for target variable.

    If your need is to convert a categorical variable into ordinal numbers, use OrdinalEncoder.

    Toy example:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import OrdinalEncoder
    from sklearn.compose import make_column_transformer
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    
    X = [[1, 'orange', 'yes'], [1, 'apple', 'yes'],
         [-1, 'orange', 'no'], [-1, 'apple', 'no']]
    y = [[1], [1], [0], [0]]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        random_state=42)
    pipe = Pipeline(
        [('encoder', make_column_transformer((OrdinalEncoder(), [1, 2]), 
                                             remainder='passthrough')),
        # applies OrdinalEncoder using column transformer for 2nd and 3rd column
         ('rf', RandomForestClassifier(n_estimators=2,random_state=42))])
    
    pipe.fit(X_train, y_train)
    
    import joblib
    joblib.dump(pipe, 'pipe.pkl')
    
    loaded_pipe = joblib.load('pipe.pkl')
    loaded_pipe.score(X_test, y_test)