I am trying to group the variables group, type, and year. Each group, type, and year have a particular code that changes from year to year. I want to create a column called "difference" where, if the group and type has a code of 200 in one year and 210 the next year, the "difference" column will register it as an increase in 10.
group <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
type <- c("small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large")
year <- c(1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,
1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996,
1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997)
code <- c(100, 100, 100, 200, 200, 200, 300, 300, 300,
150, 150, 100, 200, 200, 200, 350, 320, 300,
130, 170, 90, 210, 90, 80, 310, 300, 320)
df <- data.frame(group, type, year, code)
This is how the df looks like:
group type year code
1 A small 1995 100
2 A medium 1995 100
3 A large 1995 100
4 B small 1995 200
5 B medium 1995 200
6 B large 1995 200
7 C small 1995 300
8 C medium 1995 300
9 C large 1995 300
10 A small 1996 150
11 A medium 1996 150
12 A large 1996 100
13 B small 1996 200
14 B medium 1996 200
15 B large 1996 200
16 C small 1996 350
17 C medium 1996 320
18 C large 1996 300
19 A small 1997 130
20 A medium 1997 170
21 A large 1997 90
22 B small 1997 210
23 B medium 1997 90
24 B large 1997 80
25 C small 1997 310
26 C medium 1997 300
27 C large 1997 320
I want the following output:
group <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
type <- c("small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large", "small", "medium", "large")
year <- c(1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,
1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996,
1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997)
code <- c(100, 100, 100, 200, 200, 200, 300, 300, 300,
150, 150, 100, 200, 200, 200, 350, 320, 300,
130, 170, 90, 210, 90, 80, 310, 300, 320)
difference <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
50, 50, 0, 0, 0, 0, 50, 20, 0,
-20, 20, -10, 10, 110, 120, -40, -20, 0)
df2 <- data.frame(group, type, year, code, difference)
group type year code difference
1 A small 1995 100 NA
2 A medium 1995 100 NA
3 A large 1995 100 NA
4 B small 1995 200 NA
5 B medium 1995 200 NA
6 B large 1995 200 NA
7 C small 1995 300 NA
8 C medium 1995 300 NA
9 C large 1995 300 NA
10 A small 1996 150 50
11 A medium 1996 150 50
12 A large 1996 100 0
13 B small 1996 200 0
14 B medium 1996 200 0
15 B large 1996 200 0
16 C small 1996 350 50
17 C medium 1996 320 20
18 C large 1996 300 0
19 A small 1997 130 -20
20 A medium 1997 170 20
21 A large 1997 90 -10
22 B small 1997 210 10
23 B medium 1997 90 110
24 B large 1997 80 120
25 C small 1997 310 -40
26 C medium 1997 300 -20
27 C large 1997 320 0
This is what I tried:
df3 <- df2 %>%
group_by(group, type, year) %>%
mutate(difference = code - lag(code))
The problem is that the lag seems to not take the grouping into consideration and is instead just subtracting from the row right before it. Any suggestions?
Update on OP request:
To get 0
we could use an ifelse
statement:
df %>%
group_by(group, type) %>%
mutate(difference= ifelse(is.na(lag(code)), 0, code - lag(code))) %>%
data.frame()
group type year code difference
1 A small 1995 100 0
2 A medium 1995 100 0
3 A large 1995 100 0
4 B small 1995 200 0
5 B medium 1995 200 0
6 B large 1995 200 0
7 C small 1995 300 0
8 C medium 1995 300 0
9 C large 1995 300 0
10 A small 1996 150 50
11 A medium 1996 150 50
12 A large 1996 100 0
13 B small 1996 200 0
14 B medium 1996 200 0
15 B large 1996 200 0
16 C small 1996 350 50
17 C medium 1996 320 20
18 C large 1996 300 0
19 A small 1997 130 -20
20 A medium 1997 170 20
21 A large 1997 90 -10
22 B small 1997 210 10
23 B medium 1997 90 -110
24 B large 1997 80 -120
25 C small 1997 310 -40
26 C medium 1997 300 -20
27 C large 1997 320 20
First(answer):
As @IRTFM already points out. group only by group
and type
.
It gives almost the same output. Note the last row is different.
library(dplyr)
df %>%
group_by(group, type) %>%
mutate(difference= code - lag(code)) %>%
data.frame()
group type year code difference
1 A small 1995 100 NA
2 A medium 1995 100 NA
3 A large 1995 100 NA
4 B small 1995 200 NA
5 B medium 1995 200 NA
6 B large 1995 200 NA
7 C small 1995 300 NA
8 C medium 1995 300 NA
9 C large 1995 300 NA
10 A small 1996 150 50
11 A medium 1996 150 50
12 A large 1996 100 0
13 B small 1996 200 0
14 B medium 1996 200 0
15 B large 1996 200 0
16 C small 1996 350 50
17 C medium 1996 320 20
18 C large 1996 300 0
19 A small 1997 130 -20
20 A medium 1997 170 20
21 A large 1997 90 -10
22 B small 1997 210 10
23 B medium 1997 90 -110
24 B large 1997 80 -120
25 C small 1997 310 -40
26 C medium 1997 300 -20
27 C large 1997 320 20