I want to model the pharmaceutical costs for a group of patients for the next year, based on this years pharma data (codes for the pharmaceuticals), age, gender and this year's costs.
I used linear regression and got an R^2 of 0.69, which was surprisingly good. When I devided the patients into 5 groups of the same size based on the amount of costs for the current year, I could see that the bottom 80 % performed extremely poorly, while the top 20% made up for it with a score of 0.71.
80% of the people have costs roughly under 500 Euro, while those who have a lot of costs, have extreme costs, up to 500.000 Euros.
I think, since Linear Regression wants to minimise residuals, predicting the bottom costs with the still relatively small residuals does not bring as much gain as minimising high costs.
Is there an alternative model, that would be more useful in this context to predict small costs better as well?
This looks like a standard case for heteroscedasticity where the variance increased with the expected mean.
A few choices:
Also, check that there is not a nonlinear relationship between explanatory and dependent variables.
For example, generalized linear models use a link function to keep the prediction in the domain of the distribution of the dependent variable, for example non-negative values can be modeled using exponential mean function (log link).