Search code examples
mathmachine-learningstatisticswekalinear-regression

Understand Regression results


I have a set of numerical features that describe a phenomenon at different time points. In order to evaluate the individual performance of each feature, I perform a linear regression with a leave one out validation, and I compute the correlations and errors to evaluate the results.

So for a single feature, it would be something like:

  • Input: Feature F = {F_t1, F_t2, ... F_tn}
  • Input: Phenomenom P = {P_t1, P_t2, ... P_tn}
  • Linear Regression of P according to F, plus leave one out.
  • Evaluation: Compute correlations (linear and spearman) and errors (mean absolute and root mean squared)

For some of the variables, both correlations are really good (> 0.9), but when I take a look to the predictions, I realize that the predictions are all really close to the average (of the values to predict), so the errors are big.

How is that possible?

Is there a way to fix it?

For some technical precisions, I use the weka linear regression with the option "-S 1" in order to avoid the feature selection.


Solution

  • It seems to be because the problem we want to regress is not linear and we use a linear approach. Then it is possible to have good correlations and poor errors. It does not mean that the regression is wrong or really poor, but you have to be really careful and investigate further.

    Anyway, a non linear approach that minimizes the errors and maximize the correlation is the way to go.

    Moreover, outliers also make this problem occur.