Search code examples
pythonpandasmachine-learningscikit-learnpredict

Linear regression prediction based on group of data in test set


I have a simple dataset which looks like this:

v1  v2  v3  hour_day  sales
3   4   24    12       133
5   5   13    12       243
4   9   3     3        93
5   12  5     3        101
4   9   3     6        93
5   12  5     6        101

I created a simple LR model to train and predict the target variable "sales". And I used MAE to evaluate the model

# Define the input and target features
X= df.iloc[:,[0,1, 2, 3]]
y = df.iloc[:, 4]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


# Train and fit the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Make prediction
y_pred = regressor.predict(X_test)

# Evaluate the model
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

My code works well, but what I want to do is to predict the sales in the X_test grouped by hour of the day. In the above dataset example, there is three hours slots, 12, 3, and 6. So the output should look like this:

MAE for hour 12: 18.29
MAE for hour 3: 11.67
MAE for hour 6: 14.43

I think I should use for loop to iterate. It could be something like this:

    # Save Hour Vector
    hour_vec = deepcopy(X_test['hour_day'])

    for i in range(len(X_test)):
       y_pred = regressor.predict(np.array([X_test[i]])

So any idea how to perform it?


Solution

  • hours = list(set(X_test['hour_day']))
    results = pd.DataFrame(index=['MAE'], columns=hours)
    for hour in hours:
        idx = X_test['hour_day'] == hour
        y_pred_h = regressor.predict(X_test[idx])
        mae = metrics.mean_absolute_error(y_test[idx], y_pred_h)
        results.loc['MAE', hour] = mae
    results.loc['MAE', 'mean'] = results.mean(axis=1)[0]
    print(results)
    

    prints

                 3          6       mean
    MAE  71.405775  71.405775  71.405775