Search code examples
pythonpandasscikit-learnregression

How can I link the records in the training dataset to the corresponding model predictions?


Using scikit-learn, I've set up a regression model to predict customers' maximum spend per transaction. The dataset I'm using looks a bit like this; the target column is maximum spend per transaction during the previous year:

customer_number | metric_1 | metric_2 | target
----------------|----------|----------|-------
111             | A        | X        | 15
222             | A        | Y        | 20
333             | B        | Y        | 30

I split the dataset into training & testing sets, one-hot encode the features, train the model, and make some test predictions:

target = pd.DataFrame(dataset, columns = ["target"])
features = dataset.drop("target", axis = 1)
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size = 0.25)

train_features = pd.get_dummies(train_features)
test_features = pd.get_dummies(test_features)

model = RandomForestRegressor()
model.fit(X = train_features, y = train_target)

test_prediction = model.predict(X = test_features)

I can output various measures of the model's accuracy (mean average error, mean squared error etc) using the relevant functions in scikit-learn. However, I'd like to be able to tell which customers' predictions are the most inaccurate. So I want to be able to create a dataframe which looks like this:

customer_number | target | prediction | error
----------------|--------|----------- |------
111             | 15     | 17         | 2
222             | 20     | 19         | 1
333             | 30     | 50         | 20

I can use this to investigate if there is any correlation between the features and the model making inaccurate predictions. In this example, I can see that customer 333 has the biggest error by far, so I could potentially infer that customers with metric_1 = B end up with less accurate predictions.

I think I can calculate errors like this (please correct me if I'm wrong on this), but I don't know how to tie them back to customer number.

error = abs(test_target - test_prediction) 

How can I get the desired result?


Solution

  • The error you are computing is the absolute error. When averaged it gives the Mean Absolute Error which is commonly used to evaluate regression models. You can read about the choice of an error metric here.

    This error vector is the length of your test dataset and its elements are in the same order as your records. Many people assign them back into the dataframe. Then, if you leave customer number in there, everything should line up.

    Starting with the DataFrame df and using idiomatic names for things:

    df_train, df_test = train_test_split(df)
    
    y_train, y_test = df_train["target"], df_test["target"]
    
    X_train = df_train.drop(["customer_number", "target"], axis=1)
    X_test = df_test.drop(["customer_number", "target"], axis=1)
    
    X_train = pd.get_dummies(X_train)
    X_test = pd.get_dummies(X_test)
    
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    
    df_test["prediction"] = model.predict(X_test)
    df_test["error"] = abs(df_test["target"] - df_test["prediction"])