I have a student dataset including student information, question
id (5 questions), the sequence
of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence
variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt
variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas? Thanks!
What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum
is a function to calculate rolling distinct (Source). var0
...var4
are helpers