Search code examples
rdataframedummy-variable

Dummy variable "switch-point" in R


I have a dummy variable that serves as a flag for a number of conditions in my data set. I can't figure out how to write a function that marks the spot in which the flag assumes a "final switch" -- a value that will not change for the rest of the data frame. In the example below, everything after the 7th observation is a "y".

  dplyr::tibble(
    observation = c(seq(1,10)),
    crop = c(runif(3,1,25),
              runif(1,50,100),
              runif(2,1,10),
              runif(4,50,100)),
    flag = c(rep("n", 3),
             rep("y", 1),
             rep("n", 2),
             rep("y", 4)))

Which yields:

   observation  crop flag 
         <int> <dbl> <chr>
 1           1 13.3  n    
 2           2  4.34 n    
 3           3 17.1  n    
 4           4 80.5  y    
 5           5  9.62 n    
 6           6  8.39 n    
 7           7 92.6  y    
 8           8 74.1  y    
 9           9 95.3  y    
10          10 69.9  y    

I've tried creating a second flag that marks every switch and returns the "final" switch/flag variable, but over my whole data frame that will likely be highly inefficient. Any suggestions are welcome and appreciated.


Solution

  • One way to do this may be to create a flag that cumulatively sums occurrences of flag switches.

    cumsum_na <- function(x){
      x[which(is.na(x))] <- 0
      return(cumsum(x))
    }
    
    df <- dplyr::tibble(
        observation = c(seq(1,10)),
        crop = c(runif(3,1,25),
                  runif(1,50,100),
                  runif(2,1,10),
                  runif(4,50,100)),
        flag = c(rep("n", 3),
                 rep("y", 1),
                 rep("n", 2),
                 rep("y", 4)))
    
    df %>%
      mutate(flag2 = ifelse(flag != lag(flag), 1, 0) %>%
                   cumsum_na)
    
    # A tibble: 10 x 4
       observation  crop flag  flag2
             <int> <dbl> <chr> <dbl>
     1           1 12.1  n         0
     2           2 11.2  n         0
     3           3  4.66 n         0
     4           4 61.6  y         1
     5           5  6.00 n         2
     6           6  9.54 n         2
     7           7 67.6  y         3
     8           8 86.7  y         3
     9           9 91.6  y         3
    10          10 84.5  y         3
    

    You can then do whatever you need to using the flag2 column (eg. filter for max value, take first row, which will give you the first occurrence of constant state).