Search code examples
rstatisticslinear-regression

Customize formula in geom-smooth / ggplot2 / R


I want to customize the formula used in geom_smooth like this:

library(MASS)
library(ggplot2)

data("Cars93", package = "MASS")

str(Cars93)

Cars93.log <- transform(Cars93, log.price = log(Price))

log.model <- lm(log.price ~ Horsepower*Origin, data = Cars93.log)
summary(log.model)
plot(log.model)

p <- ggplot(data = Cars93.log, aes(x = Horsepower, y = log.price, colour = Origin)) + 
  geom_point(aes(shape = Origin, color = Origin)) +   # Punkte
  facet_grid(~ Origin) +
  theme(axis.title.x = element_text(margin=margin(15,0,0,0)),
        axis.title.y = element_text(margin=margin(0,15,0,0))) +
  scale_y_continuous(n.breaks = 7) +
  scale_colour_manual(values = c("USA" = "red","non-USA" = "black")) +
  scale_shape_manual(values = c(16,16)) +
  ylab("Price(log)")

lm.mod <- function(df) {
  y ~ x*Cars93.log$Origin
}

p_smooth <- by(Cars93.log, Cars93.log$Origin, 
               function(x) geom_smooth(data=x, method = lm, formula = lm.mod(x)))

p + p_smooth

However, I receive the error that the computation failed because of different lengths of my used variables.

length(Cars93.log$log.price)
length(Cars93.log$Origin)
length(Cars93.log$Horsepower)

But when I check the length for each variable they're all the same... Any ideas, what's wrong?

Thanks a lot, Martina


Solution

  • I agree with @Rui Barradas, seems like the issue is the lines for lm.mod and p_smooth and the by function

    Once you are making a distinction by Origin (e.g., by doing either facet_wrap or color = Origin) then geom_smooth will automatically run different models for those facets.

    p <- ggplot(data = Cars93.log, 
                aes(x = Horsepower, y = log.price, color = Origin)) + 
      geom_point(aes(shape = Origin)) +
      facet_wrap(~ Origin) +
      theme(axis.title.x = element_text(margin=margin(15,0,0,0)),
            axis.title.y = element_text(margin=margin(0,15,0,0))) +
      scale_y_continuous(n.breaks = 7) +
      scale_colour_manual(values = c("USA" = "red","non-USA" = "black")) +
      scale_shape_manual(values = c(16,16)) +
      ylab("Price(log)")
    
    p + geom_smooth(method = lm, formula = y ~ x)
    

    you can convince yourself that this is the same as the output of log.model by extending the x-axis limits to see where the geom_smooth line would cross the y axis (e.g., + coord_cartesian(xlim = c(0, 300)))

    You can also see the difference in the graph if you don't pass color = Origin to the geom_smooth function (essentially what is happening if you comment this out from the first ggplot() initialization):

    p <- ggplot(data = Cars93.log, 
                aes(x = Horsepower, y = log.price)) + # color = Origin)) + 
      geom_point(aes(shape = Origin)) +
      #facet_wrap(~ Origin) +
      theme(axis.title.x = element_text(margin=margin(15,0,0,0)),
            axis.title.y = element_text(margin=margin(0,15,0,0))) +
      scale_y_continuous(n.breaks = 7) +
      scale_colour_manual(values = c("USA" = "red","non-USA" = "black")) +
      scale_shape_manual(values = c(16,16)) +
      ylab("Price(log)")
    
    p + geom_smooth(method = lm, formula = y ~ x)