python machine-learning scikit-learn regression non-linear-regression

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png

I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.

I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.

So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?

P.S. I am just starting on Overflow so please let me know what I can do to make my question better!

Solution

This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:

Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.

So, why people still use (linear) regression?

it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables

To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first

UPD, to illustrate the point about non-linearity.

After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.

This is how it should look:

This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:

This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):

Another example of non-linearity (adding squared feature should help):

The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:

different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)

Some of the pictures above are taken from here:

http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html

This is a great read on regression diagnostics for beginners.