I am trying to calculate a cumulative sum for students that have been exposed to a support network during their schooling.
A sample dataframe would be (ID=student, Term=semester of interest, Support=Exposure to support network):
df <- data.frame(ID=c(123451, 123451, 123451, 123451, 123452, 123452, 123452, 123452,
123452, 123452, 123452, 123453, 123453, 123453, 123453, 123453, 123453, 123453, 123453),
Term= c(1141, 1148, 1158, 1141, 1158, 1161, 1148, 1151, 1158, 1138,
1141, 1138, 1141, 1141, 1148, 1138, 1148, 1158, 1161),
Support = c(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1))
Since I am looking for cumulative exposure (starting from their earliest semester) I first ordered the data by ID and by Term:
df <- df[order(df[,1], df[,2]),]
I then calculated a cumulative sum for the Support variable into a separate column
df$Dosage <- ave(df[3], df[1], FUN=cumsum)
Output:
ID Term Support Dosage
1 123451 1141 1 1
4 123451 1141 1 2
2 123451 1148 0 2
3 123451 1158 1 3
10 123452 1138 0 0
11 123452 1141 0 0
7 123452 1148 1 1
8 123452 1151 1 2
5 123452 1158 1 3
9 123452 1158 1 4
6 123452 1161 0 4
12 123453 1138 1 1
16 123453 1138 0 1
13 123453 1141 0 1
14 123453 1141 1 2
15 123453 1148 0 2
17 123453 1148 1 3
18 123453 1158 1 4
19 123453 1161 1 5
While this is useful, in the event that a student has two rows for the same semester, I would like the value for Dosage to reflect the maximum value.
So for Student=123451 and Term=1141 I would like both Dosage values to equal 2.
For Student=123452 and Term=1158, I would like both Dosage values to equal 4.
For Student=123453 and Term=1148, I would like both Dosage values to equal 3.
Expected Output:
ID Term Support Dosage
1 123451 1141 1 2
4 123451 1141 1 2
2 123451 1148 0 2
3 123451 1158 1 3
10 123452 1138 0 0
11 123452 1141 0 0
7 123452 1148 1 1
8 123452 1151 1 2
5 123452 1158 1 4
9 123452 1158 1 4
6 123452 1161 0 4
12 123453 1138 1 1
16 123453 1138 0 1
13 123453 1141 0 2
14 123453 1141 1 2
15 123453 1148 0 3
17 123453 1148 1 3
18 123453 1158 1 4
19 123453 1161 1 5
Not quite understand what your problem is, but maybe you can try this:
library(dplyr)
df <- data.frame(ID=c(123451, 123451, 123451, 123451, 123452, 123452, 123452, 123452,
123452, 123452, 123452, 123453, 123453, 123453, 123453, 123453, 123453, 123453, 123453),
Term= c(1141, 1148, 1158, 1141, 1158, 1161, 1148, 1151, 1158, 1138,
1141, 1138, 1141, 1141, 1148, 1138, 1148, 1158, 1161),
Support = c(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1))
df %>%
arrange(ID, Term) %>%
group_by(ID) %>%
mutate(Dosage = cumsum(Support)) %>%
ungroup() %>%
group_by(ID, Term) %>%
mutate(Dosage = max(Dosage)) %>%
ungroup()
ID Term Support Dosage
1 123451 1141 1 2
2 123451 1141 1 2
3 123451 1148 0 2
4 123451 1158 1 3
5 123452 1138 0 0
6 123452 1141 0 0
7 123452 1148 1 1
8 123452 1151 1 2
9 123452 1158 1 4
10 123452 1158 1 4
11 123452 1161 0 4
12 123453 1138 1 1
13 123453 1138 0 1
14 123453 1141 0 2
15 123453 1141 1 2
16 123453 1148 0 3
17 123453 1148 1 3
18 123453 1158 1 4
19 123453 1161 1 5