Search code examples
rmachine-learningstatisticsazure-machine-learning-service

evaluating linear regression (in microsoft machine learning


Im playing with linear regression in azure machine learning and evaluating a model.

Im still a bit unsure what the various metrics for evaluation mean and show, so would appreciate some correction if i am incorrect.

  1. Mean Absolute Error: Mean of the residuals (errors).
  2. Root Mean Squared Error: Std Dev of the residuals. With this i can see how far from the mean/median my absolute error is.
  3. Relative absolute error: a percentage value that shows the percentage difference between relative error and absolute error. lower values are better, indicating lower difference.
  4. relative squared error: square of the error relative to the square of the absolute. Unsure what this gives me over the relative absolute error.
  5. coefficient of determination: indication of correlation between inputs. +1 or -1 indicate perfect correlation, 0 indicates none.
  6. The histogram is showing the frequency of various buckets of error magnitudes. this shows a lot of small errors. with frequency decreasing as the value of error increases, indicating, when taken along with the poor metrics above that there are probably some sku or outliers having a large influence on the model, making it less accurate.

Are these definitions and assumptions correct?

enter image description here


Solution

  • You are almost correct on most points. To make sure we are talking in the same terms, a little bit of background:

    A linear regression uses data on some outcome variable y and independent variables x1, x2, .. and tries to find the linear combination of x1, x2, .. that best predicts y. Once this "best linear combination" is established, you can assess the quality of the fit (i.e. quality of the model) in multiple ways. The six points you mention are all key metrics for the quality of a regression equation.

    Running a regression gives you multiple "ingredients". For example, every observation will get a predicted value for the outcome variable. The difference between the observed value of y and the predicted value is called the residual or error. Residuals can be negative (if the y is overestimated) and positive (if y is underestimated). The closer the residuals are to zero, the better. But, what is "close"? The metrics you present are supposed to give an insight in this.

    • Mean absolute error: takes the absolute value of the residuals and takes the mean of that.
    • Root Mean Square Error: is the standard deviation of your residuals. This will help you see, how large the spread is of your residuals. The residuals are squared and therefore, high residuals will count in more than small residuals. A low RMSE is good.
    • Relative Absolute Error: The absolute error as a fraction of the real value of the outcome variable y. In your case, the predictions are on average 75% higher/lower than the actual value of y.

    • Relative Squared Error: The squared error (residual^2) as a fraction of the real value.

    • Coefficient of Determination: Almost correct. This ranges between 0 and 1 and can be interpreted as the explanatory power of the independent variables in explaining y. In fact, in your case the independent variables can model 38,15% of the variation in y. Also, if you have only one independent variable, this coefficient is equal to the squared correlation coefficient.

    Root Mean Squared Error and Coefficient of Determination are the most important metrics in nearly all situations. To be honest, I've never really seen the other metrics being reported.