I'm working on a regression model and to evaluate the model performance, my boss thinks that we should use this metric:
Total Absolute Error Mean = mean(y_predicted) / mean(y_true) - 1
Where mean(y_predicted) is the average of all the predictions and mean(y_true) is the average of all the true values.
I have never seen this metric being used in machine learning before and I convinced him to add Mean Absolute Percentage Error as an alternative, yet even though my model is performing better regarding MAPE, some areas underperform when we look at Total Absolute Error Mean.
My gut feeling is that this metric is wrong in displaying the real accuracy, but I can't seem to understand why.
Is Total Absolute Error Mean a valid performance metric? If not, then why? If it is, why would a regression model's accuracy increase in terms of MAPE, but not in terms of Total Absolute Error Mean?
Thank you in advance!
I would kindly suggest to inform your boss that, when one wishes to introduce a new metric, it is on him/her to demonstrate why it is useful on top of the existing ones, not the other way around (i.e. us demonstrating why it is not); BTW, this is exactly the standard procedure when someone really comes up with a new proposed metric in a research paper, like the recent proposal of the Maximal Information Coefficient (MIC).
That said, it is not difficult to demonstrate in practice that this proposed metric is a poor one with some dummy data:
import numpy as np
from sklearn.metrics import mean_squared_error
# your proposed metric:
def taem(y_true, y_pred):
return np.mean(y_true)/np.mean(y_pred)-1
# dummy true data:
y_true = np.array([0,1,2,3,4,5,6])
Now, suppose that we have a really awesome model, which predicts perfectly, i.e. y_pred1 = y_true
; in this case both MSE and your proposed TAEM will indeed be 0:
y_pred1 = y_true # PERFECT predictions
mean_squared_error(y_true, y_pred1)
# 0.0
taem(y_true, y_pred1)
# 0.0
So far so good. But let's now consider the output of a really bad model, which predicts high values when it should have predicted low ones, and vice versa; in other words, consider a different set of predictions:
y_pred2 = np.array([6,5,4,3,2,1,0])
which is actually y_pred1
in reverse order. Now, it easy to see that here we will also have a perfect TAEM score:
taem(y_true, y_pred2)
# 0.0
while of course MSE would have warned us that we are very far indeed from perfect predictions:
mean_squared_error(y_true, y_pred2)
# 16.0
Bottom line: Any metric that ignores element-wise differences in favor of only averages suffers from similar limitations, namely taking identical values for any permutation of the predictions, a characteristic which is highly undesirable for a useful performance metric.