Search code examples
rfor-loopvectorizationapply

Vectorized solution to find uninterrupted attendance streaks


I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they pass away. I need to identify which participants have taken part for a consistent 'streak' since the start of the survey (i.e. if they stop, they stop for good).

I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and iterating over each row is very slow. Is there another approach that I could use that eliminates the need for looping over all the rows? I also thought about using apply but I cannot figure out how to get it to work.

In the example, TRUE means they participated in that year. The loop creates two vectors - finalyear for the last year they took part, and streak to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).

dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), 
                             "2000" = c(T, F, T, F, T), 
                             "2001" = c(T, T, T, T, T), 
                             "2002" = c(F, T, T, T, T), 
                             "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL

for (i in 1:nrow(dat)) {

    x <- as.numeric(dat[i,2:6])
    y <- max(grep(1, x))

    finalyear[i] <- y
    streak[i] <- sum(x) == y

}

dat$finalyear <- finalyear
dat$streak <- streak

Solution

  • We could use max.col and rowSums as a vectorized approach.

    dat$finalyear <- max.col(dat[-1], 'last')
    

    If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.

    dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
    

    Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'

    dat$streak <-  rowSums(dat[,2:6])==dat$finalyear
    dat
    #   ids X1999 X2000 X2001 X2002 X2003 finalyear streak
    #1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
    #2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
    #3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
    #4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
    #5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE
    

    Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by @ColonelBeauvel

    library(dplyr)
    mutate(dat, finalyear=max.col(dat[-1], 'last'), 
                streak=rowSums(dat[-1])==finalyear)