Search code examples
rrecode

r recode by a splitting rule


I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.

Here is a sample dataset:

data <- data.frame(
person =   c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))

data
   person question sequence
1       1        1        1
2       1        2        1
3       1        2        2
4       1        3        1
5       1        3        2
6       1        3        3
7       1        4        1
8       1        3        4
9       1        5        1
10      1        1        2
11      1        2        3
12      2        1        1
13      2        1        2
14      2        1        3
15      2        2        1
16      2        3        1
17      2        4        1
18      2        4        2
19      2        4        3
20      2        5        1
21      2        5        2
22      2        4        4
23      2        3        2
24      2        4        5
25      2        4        6
26      2        5        3
27      2        4        7
28      2        5        4

sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.

> data
   person question sequence attempt
1       1        1        1 initial
2       1        2        1 initial
3       1        2        2 initial
4       1        3        1 initial
5       1        3        2 initial
6       1        3        3 initial
7       1        4        1 initial
8       1        3        4 initial
9       1        5        1 initial
10      1        1        2  review
11      1        2        3  review
12      2        1        1 initial
13      2        1        2 initial
14      2        1        3 initial
15      2        2        1 initial
16      2        3        1 initial
17      2        4        1 initial
18      2        4        2 initial
19      2        4        3 initial
20      2        5        1 initial
21      2        5        2 initial
22      2        4        4  review
23      2        3        2  review
24      2        4        5  review
25      2        4        6  review
26      2        5        3  review
27      2        4        7  review
28      2        5        4  review

Any ideas? Thanks!


Solution

  • What a challenging question. Took almost 2 hours to find the solution.

    Try this

    library(dplyr)
    
    dist_cum <- function(var)
      sapply(seq_along(var), function(x) length(unique(head(var, x))))
    
    data %>% 
      mutate(var0 = n_distinct(question)) %>%
      group_by(person) %>% 
      mutate(var1 = dist_cum(question),
             var2 = cumsum(c(1, diff(question) != 0))) %>%
      ungroup() %>%
      mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
      group_by(person, var2) %>%
      mutate(var4 = min(var3)) %>%
      ungroup() %>%
      mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
      select(-starts_with("var")) %>%
      as.data.frame
    

    Result

       person question sequence  attemp
    1       1        1        1 initial
    2       1        2        1 initial
    3       1        2        2 initial
    4       1        3        1 initial
    5       1        3        2 initial
    6       1        3        3 initial
    7       1        4        1 initial
    8       1        3        4 initial
    9       1        5        1 initial
    10      1        1        2  review
    11      1        2        3  review
    12      2        1        1 initial
    13      2        1        2 initial
    14      2        1        3 initial
    15      2        2        1 initial
    16      2        3        1 initial
    17      2        4        1 initial
    18      2        4        2 initial
    19      2        4        3 initial
    20      2        5        1 initial
    21      2        5        2 initial
    22      2        4        4  review
    23      2        3        2  review
    24      2        4        5  review
    25      2        4        6  review
    26      2        5        3  review
    27      2        4        7  review
    28      2        5        4  review
    

    dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers