Search code examples
rggplot2poly

Polynomial Regression Plot and 'newdata' Error


I am trying to better understand why stat smooth won't plot my polynomial regression line unless my x variable (independent variable) is assigned as a value outside of the plot first (e.g. x <- dataset$Salary)

Dataset

dataset <- tibble(Level = 1:10,
                  Salary = c(45000, 50000, 60000, 80000, 110000, 150000, 200000, 300000, 500000, 1000000))

Initial plot that returns error

Error:'newdata' had 80 rows but variables found have 10 rows

ggplot(data = dataset, aes(x = Level, y = Salary)) +
  geom_point(color = "red") +
  stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~
      poly(dataset$Level, 3)) +
  ggtitle("Truth or Bluff (Linear Regression)") +
  xlab("Level ") +
  ylab("Salary") +
  theme(plot.title = element_text(hjust = 0.5))

The solution that worked

x <- dataset$Level

ggplot(data = dataset, aes(x = Level, y = Salary)) +
  geom_point(color = "red") +
  stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~ 
      poly(x, 3)) +
  ggtitle("Truth or Bluff (Linear Regression)") +
  xlab("Level ") +
  ylab("Salary") +
  theme(plot.title = element_text(hjust = 0.5))

From my understanding

x <- dataset$Salary is no different from dataset$Salary aside from being contained in a Value. My only thought is it has to do with how poly() views x, a numeric vector vs. how it views dataset$Salary as an extracted vector. I

Other than that I would expect the same result, but that is not the case.

I also tried renaming x to t and it does exactly what the first graph did, so I don't understand why x is so significant if its just the name of the Value.

t <- dataset$Level

ggplot(data = dataset, aes(x = Level, y = Salary)) +
  geom_point(color = "red") +
  stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~ 
      poly(t, 3)) +
  ggtitle("Truth or Bluff (Linear Regression)") +
  xlab("Level ") +
  ylab("Salary") +
  theme(plot.title = element_text(hjust = 0.5))

Solution

  • the formula to stat_smooth uses mapped aesthetics, i.e. x and y (as you have mapped x=Level, y=Salary). If you had mapped colour=SomeVariable you'd have to use colour rather than SomeVariable also.

    so

    stat_smooth(..., formula=y ~ poly(x, 3))
    

    The reason you are getting the warning

    In addition: Warning message:
    'newdata' had 80 rows but variables found have 10 rows 
    

    is that your data dataset has 10 rows. However stat_smooth is getting the fitted Y values of the model over 80 X points in order to get a smooth looking line, so these lengths don't match up.

    The reason you don't get the error when you use poly(x, 3) in the formula is because this x resolves to the x of ggplot's constructed dataframe, rather than the global x you defined.

    Similarly the reason you do get the error with poly(t, 3) is because t is not in ggplot's constructed dataframe so the next t on the search path is the global t.