Search code examples
rggplot2smoothing

smoothed grouped proportion plot


I have the following data set:

set.seed(10)
start_date <- as.Date('2000-01-01')  
end_date <- as.Date('2000-01-10')   


Data <- data.frame(
  id = rep((1:1000),10), 
  group = rep(c("A","B"), 25),
  x = sample(1:100),
  y = sample(c("1", "0"), 10, replace = TRUE),
  date = as.Date(
       sample(as.numeric(start_date):
              as.numeric(end_date), 1000,
              replace = T), origin = '2000-01-01'))

With that, I create the following plot:

Data %>% mutate(treated = factor(group)) %>%
  mutate(date = as.POSIXct(date)) %>% #convert date to date
  group_by(treated, date) %>% #group
  summarise(prop = sum(y=="1")/n()) %>% #calculate proportion 
  ggplot()+ theme_classic() + 
  geom_line(aes(x = date, y = prop, color = treated)) +
  geom_point(aes(x = date, y = prop, color = treated)) +
  geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)

Unfortunately the plot is pretty 'jumpy' and I would like to smooth it. I tried geom_smooth() but can't get it to work. Other questions regarding smoothing didn't help me because they missed the grouping aspect and therefore had a different structure. However, the example data set is in reality part of a larger data set so I need to stick to that code.

[Edit: the geom_smooth() code I tried is geom_smooth(method = 'auto', formula = y ~ x)]

Can someone point me into the right direction? Many thanks and all the best.


Solution

  • Is this what you want by a smoothed line? You call geom_smooth with aesthetics, not in combination with geom_line. You can choose different smoothing methods, though the default loess with low observations is usually what people want. As an aside, I don't think this is necessarily nicer to look at than the geom_line version, and in fact is slightly less readable. geom_smooth is best used when there are many y observations for every x which makes patterns hard to see, geom_line is good for 1-1.

    EDIT: After looking at what you're doing more closely, I added a second plot that doesn't directly calculate the treatment-date means and just uses geom_smooth directly. That lets you get a more reasonable confidence interval instead of having to remove it as before.

    set.seed(10)
    start_date <- as.Date('2000-01-01')  
    end_date <- as.Date('2000-01-10')   
    
    
    Data <- data.frame(
      id = rep((1:1000),10), 
      group = rep(c("A","B"), 25),
      x = sample(1:100),
      y = sample(c("1", "0"), 10, replace = TRUE),
      date = as.Date(
        sample(as.numeric(start_date):
                 as.numeric(end_date), 1000,
               replace = T), origin = '2000-01-01'))
    
    library(tidyverse)
    Data %>%
      mutate(treated = factor(group)) %>%
      mutate(date = as.POSIXct(date)) %>% #convert date to date
      group_by(treated, date) %>% #group
      summarise(prop = sum(y=="1")/n()) %>% #calculate proportion 
      ggplot() +
      theme_classic() + 
      geom_smooth(aes(x = date, y = prop, color = treated), se = F) +
      geom_point(aes(x = date, y = prop, color = treated)) +
      geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)
    #> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
    

    Data %>%
      mutate(treated = factor(group)) %>%
      mutate(y = ifelse(y == "0", 0, 1)) %>% 
      mutate(date = as.POSIXct(date)) %>% #convert date to date
      ggplot() +
      theme_classic() +
      geom_smooth(aes(x = date, y = y, color = treated), method = "loess") +
      geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)
    

    Created on 2018-03-27 by the reprex package (v0.2.0).