Search code examples
machine-learningregressionlinear-regressionnon-linear-regression

How to select regression algorithm for noisy (scattered) data?


I am going to do regression analysis with multiple variables. In my data I have n = 23 features and m = 13000 training examples. Here is the plot of my training data (area of houses against price):

enter image description here

There are 13000 training examples on the plot. As you can see it is relatively noisy data. My question is which regression algorithm is more appropriate and reasonable to use in my case. I mean is it more logical to use simple linear regression or some nonlinear regression algorithm.

To be more clear I provide some examples.
Here is some unrelated example of linear regression fit:

enter image description here

And some unrelated example of nonlinear regression fit: enter image description here

And now I provide some hypothetic regression lines for my data: enter image description here AFAIK primitive linear regression for my data will generate very high error cost because it is very noisy and scattered data. On the other hand, there is no apparent nonlinear pattern (for example sinusoidal). What regression algorithm will be more reasonable to use in my case (house prices data) in order to get more or less appropriate houses' price prediction and why this algorithm (linear or nonlinear) is more reasonable?


Solution

  • Using a non linear algorithm will reduce the error on your training set, as you will use a curve that 'fits' your data better. However, it could lead to overfitting.

    To avoid this, a good thing to do would be to simultaneously plot the error (the cost function) on your training data and on your test data. Addding more complexity to your model will reduce the error on your training data, but at one point it will make it higher for your test data.

    test