Search code examples
rggplot2linear-regression

Clarifying the aim of linear regression with multiple predictor variables and how to plot using ggplot2


I'm trying to learn the intricacies of linear regression for prediction, and I'd like to ask two questions:

  1. I've got one dependent variable (call it X) and, let's say, ten independent variables. I can use lm() to generate a model. But my question is this: is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X? I assumed the latter, but after several hours of reading online I am now unsure.

  2. If the aim is to discover the best combination of predictors of X, then (once I've identified that combination) how is a combination plotted properly? Plotting one line is easy, but for a combination would it be proper to (a) plot ten distinct regression lines (one per independent variable) or (b) plot a single line that somehow represents the combination? I've provided the summary() I'm working with in case it facilitates answering this question. enter image description here


Solution

  • Is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X?

    This depends mainly on the situation/context you are in. If you are always going to have access to these predictors, then yes, you'd like to identify the best model that will (likely) use a combination of these predictors. Obviously you want to keep in mind issues like overfitting and make sure the predictors you include are actually contributing something meaningful to your model, but there's no reason not to include multiple predictors if they make your model meaningfully better.

    However, in many real world scenarios predictors are not free. It might cost $10,000 to collect each predictor and the organization you are working for only has the budget to collect one predictor. Thus, you might only be interested in the single best predictor because it is not practical to collect more than one going forward. In this case you'd also just be interested in how well that variable predicts in a simple regression, not a multiple regression, since you won't be controlling for other variables in the future anyway (but looking at the multiple regression results could still provide insight).

    how is a combination plotted properly?

    Again, this depends on context. However, in most cases you probably don't want to plot 10 regression lines because that's too overwhelming to look at and you will probably never have 10 variables that meaningfully contribute to your model. I'm actually kind of surprised your adjusted R^2 is not lower given you have quite a few variables so close to zero, unless they're just on massive scales.

    First, who is viewing this graph? Is it you? If so, what information do you need to see that isn't being conveyed by the beta parameters? If it's someone else, who are they? Are they a stakeholder who knows nothing about statistics? If that's the case, you want a pretty simple graph that drives home your main point. Second, what is the purpose of your predictions and how does the process you are predicting unfold in the real world? Let's say I'm predicting how well people perform on the job given their scores on some different selection measures. The first thing you need to consider is, how is that selection happening? Are candidates screened on their answers to some personality questions and only the top scorers get an interview? In that case, it might be useful to create multiple graphs that show that process. However, candidates might be reviewed holistically and assigned a sum score based on all these predictors. In that case one regression line makes sense because you are interested in how these predictors act in concert.

    There is no one answer to this question because the answers depend on the reason you're doing a regression in the first place. Once you identify the reason you're trying to predict this thing and the context that the process is happening in you should probably be able to determine what makes most sense. There is no "right" answer you'll find in a textbook because most real life problems are not in textbooks.