Search code examples
rdplyrcaselag

R Lag Variable And Skip Value Between


DATA = data.frame(STUDENT = c(1,1,1,2,2,2,3,3,4,4),
                  SCORE = c(6,4,8,10,9,0,2,3,3,7),
                  CLASS = c('A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'B'),
                  WANT = c(NA, NA, 2, NA, NA, -10, NA, NA, NA, NA))

I have DATA and wish to create 'WANT' which is calculate by:

For each STUDENT, find the SCORE where SCORE equals to SCORE(CLASS = C) - SCORE(CLASS = A)

EX: SCORE(STUDENT = 1, CLASS = C) - SCORE(STUDENT = 1, CLASS = A) = 8-6=2

Solution

  • Assuming at most one 'C' and 'A' CLASS per each 'STUDENT', just subset the 'SCORE' where the CLASS value is 'C', 'A', do the subtraction and assign the value only to position where CLASS is 'C' by making all other positions to NA (after grouping by 'STUDENT')

    library(dplyr)
    DATA <- DATA %>%
       group_by(STUDENT) %>% 
       mutate(WANT2 = (SCORE[CLASS == 'C'][1] - SCORE[CLASS == 'A'][1]) * 
           NA^(CLASS != "C")) %>%
       ungroup
    

    -output

    # A tibble: 10 × 5
       STUDENT SCORE CLASS  WANT WANT2
         <dbl> <dbl> <chr> <dbl> <dbl>
     1       1     6 A        NA    NA
     2       1     4 B        NA    NA
     3       1     8 C         2     2
     4       2    10 A        NA    NA
     5       2     9 B        NA    NA
     6       2     0 C       -10   -10
     7       3     2 B        NA    NA
     8       3     3 C        NA    NA
     9       4     3 A        NA    NA
    10       4     7 B        NA    NA