Search code examples
rdplyrgroup-byrowlag

How to group variables then subtract by rows in R?


I am trying to group the variables group, type, and year. Each group, type, and year have a particular code that changes from year to year. I want to create a column called "difference" where, if the group and type has a code of 200 in one year and 210 the next year, the "difference" column will register it as an increase in 10.

group <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
type <- c("small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large")
year <- c(1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,
          1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996,
          1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997)
code <- c(100, 100, 100, 200, 200, 200, 300, 300, 300,
          150, 150, 100, 200, 200, 200, 350, 320, 300,
          130, 170, 90, 210, 90, 80, 310, 300, 320)

df <- data.frame(group, type, year, code)

This is how the df looks like:

 group   type year code
1      A  small 1995  100
2      A medium 1995  100
3      A  large 1995  100
4      B  small 1995  200
5      B medium 1995  200
6      B  large 1995  200
7      C  small 1995  300
8      C medium 1995  300
9      C  large 1995  300
10     A  small 1996  150
11     A medium 1996  150
12     A  large 1996  100
13     B  small 1996  200
14     B medium 1996  200
15     B  large 1996  200
16     C  small 1996  350
17     C medium 1996  320
18     C  large 1996  300
19     A  small 1997  130
20     A medium 1997  170
21     A  large 1997   90
22     B  small 1997  210
23     B medium 1997   90
24     B  large 1997   80
25     C  small 1997  310
26     C medium 1997  300
27     C  large 1997  320

I want the following output:

group <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
type <- c("small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large")
year <- c(1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,
          1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996,
          1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997)
code <- c(100, 100, 100, 200, 200, 200, 300, 300, 300,
          150, 150, 100, 200, 200, 200, 350, 320, 300,
          130, 170, 90, 210, 90, 80, 310, 300, 320)
difference <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
               50, 50, 0, 0, 0, 0, 50, 20, 0,
               -20, 20, -10, 10, 110, 120, -40, -20, 0)

df2 <- data.frame(group, type, year, code, difference)

   group   type year code difference
1      A  small 1995  100         NA
2      A medium 1995  100         NA
3      A  large 1995  100         NA
4      B  small 1995  200         NA
5      B medium 1995  200         NA
6      B  large 1995  200         NA
7      C  small 1995  300         NA
8      C medium 1995  300         NA
9      C  large 1995  300         NA
10     A  small 1996  150         50
11     A medium 1996  150         50
12     A  large 1996  100          0
13     B  small 1996  200          0
14     B medium 1996  200          0
15     B  large 1996  200          0
16     C  small 1996  350         50
17     C medium 1996  320         20
18     C  large 1996  300          0
19     A  small 1997  130        -20
20     A medium 1997  170         20
21     A  large 1997   90        -10
22     B  small 1997  210         10
23     B medium 1997   90        110
24     B  large 1997   80        120
25     C  small 1997  310        -40
26     C medium 1997  300        -20
27     C  large 1997  320          0

This is what I tried:

df3 <- df2 %>%
  group_by(group, type, year) %>%
  mutate(difference = code - lag(code))

The problem is that the lag seems to not take the grouping into consideration and is instead just subtracting from the row right before it. Any suggestions?


Solution

  • Update on OP request: To get 0 we could use an ifelse statement:

    df %>% 
      group_by(group, type) %>% 
      mutate(difference= ifelse(is.na(lag(code)), 0, code - lag(code))) %>% 
      data.frame()
    
     group   type year code difference
    1      A  small 1995  100          0
    2      A medium 1995  100          0
    3      A  large 1995  100          0
    4      B  small 1995  200          0
    5      B medium 1995  200          0
    6      B  large 1995  200          0
    7      C  small 1995  300          0
    8      C medium 1995  300          0
    9      C  large 1995  300          0
    10     A  small 1996  150         50
    11     A medium 1996  150         50
    12     A  large 1996  100          0
    13     B  small 1996  200          0
    14     B medium 1996  200          0
    15     B  large 1996  200          0
    16     C  small 1996  350         50
    17     C medium 1996  320         20
    18     C  large 1996  300          0
    19     A  small 1997  130        -20
    20     A medium 1997  170         20
    21     A  large 1997   90        -10
    22     B  small 1997  210         10
    23     B medium 1997   90       -110
    24     B  large 1997   80       -120
    25     C  small 1997  310        -40
    26     C medium 1997  300        -20
    27     C  large 1997  320         20
    

    First(answer): As @IRTFM already points out. group only by group and type. It gives almost the same output. Note the last row is different.

    library(dplyr)
    
    df %>% 
      group_by(group, type) %>% 
      mutate(difference= code - lag(code)) %>% 
      data.frame()
    
     group   type year code difference
    1      A  small 1995  100         NA
    2      A medium 1995  100         NA
    3      A  large 1995  100         NA
    4      B  small 1995  200         NA
    5      B medium 1995  200         NA
    6      B  large 1995  200         NA
    7      C  small 1995  300         NA
    8      C medium 1995  300         NA
    9      C  large 1995  300         NA
    10     A  small 1996  150         50
    11     A medium 1996  150         50
    12     A  large 1996  100          0
    13     B  small 1996  200          0
    14     B medium 1996  200          0
    15     B  large 1996  200          0
    16     C  small 1996  350         50
    17     C medium 1996  320         20
    18     C  large 1996  300          0
    19     A  small 1997  130        -20
    20     A medium 1997  170         20
    21     A  large 1997   90        -10
    22     B  small 1997  210         10
    23     B medium 1997   90       -110
    24     B  large 1997   80       -120
    25     C  small 1997  310        -40
    26     C medium 1997  300        -20
    27     C  large 1997  320         20