I have a large dataset of this format. I would like to a) identify those IDs/rows with the following sequence of values 1 - 1 - >1 - >1 anywhere between x1 and x10; and b) generate a new variable ("event") that identifies the beginning of the sequence, taking the value X1,...,X10.
my_df <- data.frame(ID = c("a","b","c","d","e","f","g","h"),
replicate(8,sample(1:4,8,rep=TRUE)))
For a), I have replaced values >1 with 2, then pasted the values from X1 to X10, and then filtered for the sequence 1 - 1 - 2 - 2. For b), I created the variable "event" using nested ifelse() to identify where the sequence begins. This works ok with only 8 columns. Is there a way to increase efficiency for datasets with more columns?
I highly appreciate any pointers!
df_seq <- my_df%>%
mutate_at(vars(starts_with('X')), funs(ifelse(. > 1, 2, .)))%>%
mutate(seq = paste(X1,"-",X2,"-",X3,"-",X4,"-",X5,"-",X6,"-",X7,"-",X8))%>%
filter(grepl("1 - 1 - 2 - 2", seq))%>%
mutate(event = ifelse(X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2,"X1",
ifelse(X2 == 1 & X3 == 1 & X4 == 2 & X5 == 2,"X2",
ifelse(X3 == 1 & X4 == 1 & X5 == 2 & X6 == 2,"X3",
ifelse(X4 == 1 & X5 == 1 & X6 == 2 & X7 == 2,"X4","X5")))))
You can use regexpr
to capture the first occurrence of consecutive 1
s followed by values greater than 1
, for example
set.seed(0)
my_df %>%
mutate(event = regexpr("1{2}[^1]{2}", do.call(paste0, select(., -ID)))) %>%
mutate(event = ifelse(event > 0, paste0("X", event), NA))
gives
ID X1 X2 X3 X4 X5 X6 X7 X8 event
1 a 2 3 2 1 2 2 4 3 <NA>
2 b 1 2 2 1 1 2 1 4 <NA>
3 c 4 2 2 1 4 2 3 2 <NA>
4 d 3 3 2 2 3 3 2 2 <NA>
5 e 1 3 3 1 1 4 1 3 <NA>
6 f 2 1 1 1 4 4 4 3 X3
7 g 1 1 3 2 3 4 4 2 X1
8 h 3 1 1 2 2 2 1 2 X2