statistics linear-regression statsmodels outliers

Alternatives to linear regression for dataset with many points with small value and some extreme values

I want to model the pharmaceutical costs for a group of patients for the next year, based on this years pharma data (codes for the pharmaceuticals), age, gender and this year's costs.

I used linear regression and got an R^2 of 0.69, which was surprisingly good. When I devided the patients into 5 groups of the same size based on the amount of costs for the current year, I could see that the bottom 80 % performed extremely poorly, while the top 20% made up for it with a score of 0.71.

80% of the people have costs roughly under 500 Euro, while those who have a lot of costs, have extreme costs, up to 500.000 Euros.

I think, since Linear Regression wants to minimise residuals, predicting the bottom costs with the still relatively small residuals does not bring as much gain as minimising high costs.

Is there an alternative model, that would be more useful in this context to predict small costs better as well?

Solution

This looks like a standard case for heteroscedasticity where the variance increased with the expected mean.

A few choices:

use WLS and use weights depending on the predicted value or on some of the predictors.
transform the dependent variable, e.g. log(y) and estimate a log-normal model
use a distribution that has variance increasing in mean, e.g.
Poisson has variance equal to mean. We need to use quasi-poisson for continuous variables. Gamma has variance quadratic in mean.
Those distributions are usually implemented in GLM.

Also, check that there is not a nonlinear relationship between explanatory and dependent variables.
For example, generalized linear models use a link function to keep the prediction in the domain of the distribution of the dependent variable, for example non-negative values can be modeled using exponential mean function (log link).