I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they pass away. I need to identify which participants have taken part for a consistent 'streak' since the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and iterating over each row is very slow. Is there another approach that I could use that eliminates the need for looping over all the rows? I also thought about using apply
but I cannot figure out how to get it to work.
In the example, TRUE
means they participated in that year. The loop creates two vectors - finalyear
for the last year they took part, and streak
to show if they completed all years before the finalyear
(i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T),
"2000" = c(T, F, T, F, T),
"2001" = c(T, T, T, T, T),
"2002" = c(F, T, T, T, T),
"2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
We could use max.col
and rowSums
as a vectorized
approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE
values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums
. The FALSE
will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums
of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by @ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)