Search code examples
rdataframedplyr

Subsetting rows with sequence of values and identifying columns where sequence begins


I have a large dataset of this format. I would like to a) identify those IDs/rows with the following sequence of values 1 - 1 - >1 - >1 anywhere between x1 and x10; and b) generate a new variable ("event") that identifies the beginning of the sequence, taking the value X1,...,X10.

my_df <- data.frame(ID = c("a","b","c","d","e","f","g","h"),
                    replicate(8,sample(1:4,8,rep=TRUE)))

For a), I have replaced values >1 with 2, then pasted the values from X1 to X10, and then filtered for the sequence 1 - 1 - 2 - 2. For b), I created the variable "event" using nested ifelse() to identify where the sequence begins. This works ok with only 8 columns. Is there a way to increase efficiency for datasets with more columns?

I highly appreciate any pointers!

df_seq <- my_df%>%
  mutate_at(vars(starts_with('X')), funs(ifelse(. > 1, 2, .)))%>%
  mutate(seq = paste(X1,"-",X2,"-",X3,"-",X4,"-",X5,"-",X6,"-",X7,"-",X8))%>%
  filter(grepl("1 - 1 - 2 - 2", seq))%>%
  mutate(event = ifelse(X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2,"X1",
                 ifelse(X2 == 1 & X3 == 1 & X4 == 2 & X5 == 2,"X2",
                 ifelse(X3 == 1 & X4 == 1 & X5 == 2 & X6 == 2,"X3",
                 ifelse(X4 == 1 & X5 == 1 & X6 == 2 & X7 == 2,"X4","X5")))))

Solution

  • You can use regexpr to capture the first occurrence of consecutive 1s followed by values greater than 1, for example

    set.seed(0)
    my_df %>%
        mutate(event = regexpr("1{2}[^1]{2}", do.call(paste0, select(., -ID)))) %>%
        mutate(event = ifelse(event > 0, paste0("X", event), NA))
    

    gives

      ID X1 X2 X3 X4 X5 X6 X7 X8 event
    1  a  2  3  2  1  2  2  4  3  <NA>
    2  b  1  2  2  1  1  2  1  4  <NA>
    3  c  4  2  2  1  4  2  3  2  <NA>
    4  d  3  3  2  2  3  3  2  2  <NA>
    5  e  1  3  3  1  1  4  1  3  <NA>
    6  f  2  1  1  1  4  4  4  3    X3
    7  g  1  1  3  2  3  4  4  2    X1
    8  h  3  1  1  2  2  2  1  2    X2