Search code examples
pythondataframescikit-learnpipeline

Key Error when passing list of input in .predict() using Pipeline


From what I found out when trying myself and reading here on stackoverflow, When I pass a pandas dataframe to .predict(), it successfully gives me a prediction value. Like below:

pipe = Pipeline([('OneHotEncoder', encoder), ('RobustScaler', scaler),('RandomForestRegressor',RFregsr)])
pipe.fit(X_train, y_train)
with open('trained_RFregsr.pkl','wb') as f:
    pickle.dump(pipe, f)
test = pipe.predict(X[0:1])
print(test)

>> [10.82638889]

But when I try to pass in a list of all input values required, 25 in my case, it returns a key error. This is related to how pandas dataframe only returns column names when iterated, and not the values.

test = pipe.predict([['M', 15, 'U', 'LE3', 'T', 4, 3, 'teacher', 'services', 1, 3, 0,
        'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 5, 4, 4, 2, 15, 16]])
print(test)
>> KeyError : 'sex'

I have trained a model using 25 values consisting of categoricals and numerical values to predict a single int value. As to why I am pickle-ing the file, I have to deploy it using FastAPI and it has to receive input from API endpoints. If required I can post complete code somewhere. Please tell me how I can safely pass a list of required inputs so that the model can predict on them?

EDIT: This is how I have used the OneHotEncoder:

import category_encoders as ce
encoder = ce.OneHotEncoder()

x_train = encoder.fit_transform(X_train)

x_test = encoder.transform(X_test)


Solution

  • This looks like an error where encoder is a ColumnTransformer expecting a pandas dataframe. pipe.predict is looking for a column named sex, but not finding one.

    For example, this:

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.compose import ColumnTransformer
    import pandas as pd
    
    df = pd.DataFrame({
        "zero":  ["A", "B", "C", "A", "B", "C"],
        "one":   [1, 1, 2, 1, 2, 1],
        "two":   [0.5, 0.3, 0.2, 0.1, 0.9, 0.7],
        "label": [0, 0, 0, 1, 1, 1]})
    
    encoder = ColumnTransformer(
        [('ohe', OneHotEncoder(), ["zero", "one"])], remainder="passthrough")
    
    X, y = df.drop(["label"], axis=1), df["label"]
    
    pipe = Pipeline([('ohe', encoder), ('clf', RandomForestClassifier())])
    pipe.fit(X, y)
    pipe.predict([["A", 1, 0.5]])
    

    Results in (scikit-learn==1.2.0):

    ValueError: Specifying the columns using strings is only supported for pandas DataFrames
    

    But switching to:

    X_test = pd.DataFrame([["A", 1, 0.5]], columns=["zero", "one", "two"])
    print(pipe.predict(X_test))
    # [0]