Search code examples
rdplyrlag

Lagging variable by group does not work in dplyr


I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.

This is my problem:

library(dplyr)

df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
       arrange(monthvec) %>%
       mutate(growth=ifelse(monthvec==1, 0.3,
                   ifelse(monthvec==2, 0.5,
                          ifelse(monthvec==3, 0.7,
                                 ifelse(monthvec==4, 0.1,
                                        ifelse(monthvec==5, 0.6,NA))))))

df%>%
   group_by(monthvec) %>%
   mutate(lag.growth = lag(growth, order_by=monthvec))

Source: local data frame [13 x 3]
Groups: monthvec [5]

monthvec growth lag.growth
  <int>  <dbl>      <dbl>
1         1    0.3         NA
2         1    0.3        0.3
3         2    0.5         NA
4         2    0.5        0.5
5         3    0.7         NA
6         3    0.7        0.7
7         3    0.7        0.7
8         4    0.1         NA
9         4    0.1        0.1
10        4    0.1        0.1
11        5    0.6         NA
12        5    0.6        0.6
13        5    0.6        0.6

This is what I'd like it to be in the end:

df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)

   monthvec growth lag.growth
1         1    0.3         NA
2         1    0.3         NA
3         2    0.5        0.3
4         2    0.5        0.3
5         3    0.7        0.5
6         3    0.7        0.5
7         3    0.7        0.5
8         4    0.1        0.7
9         4    0.1        0.7
10        4    0.1        0.7 
11        5    0.6        0.1
12        5    0.6        0.1
13        5    0.6        0.1

I believe that one problem is that my groups are not of equal length...

Thanks for helping out.


Solution

  • Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.

    library(dplyr)
    
    df %>% 
     group_by(monthvec) %>% 
     mutate(cnt = n()) %>% 
     ungroup() %>% 
     mutate(lag.growth = lag(growth, first(cnt))) %>% 
     group_by(monthvec) %>% 
     mutate(lag.growth = first(lag.growth)) %>% 
     select(-cnt)
    

    which gives,

    # A tibble: 13 x 3
    # Groups:   monthvec [5]
       monthvec growth lag.growth
          <int>  <dbl>      <dbl>
     1        1    0.3         NA
     2        1    0.3         NA
     3        2    0.5        0.3
     4        2    0.5        0.3
     5        3    0.7        0.5
     6        3    0.7        0.5
     7        3    0.7        0.5
     8        4    0.1        0.7
     9        4    0.1        0.7
    10        4    0.1        0.7
    11        5    0.6        0.1
    12        5    0.6        0.1
    13        5    0.6        0.1