Search code examples
rdataframedata-cleaningsequential

Looking for break patterns in row-wise sequences of values and NA's


I'm working with a dataset where each row represents where an individual used services. This implicitly also tracks whether someone uses services, because if they did not, the column value for the month is NA. I want to identify cases where there are breaks (periods of absence followed by return) in an individual's service based on transitions between consecutive month columns.

In other words, I am specifically interested in identifying when someone goes from using services (there are values in a month's column) to not using services (the value of 1+ subsequent month is NA) and then back to using services (values in 1+ subsequent month columns, preceded by NAs). My hope is to have a binary TRUE/FALSE column ("Breaks_in_Service"). When someone begins services does not matter (meaning, NAs only matter after the first month-column with a value in it).

Here's a simplified version of my dataset:

# Sample Data
simp_2021 <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  jan21_ORG_NAME = c("Org A", NA, NA, "Org B", "Org B"),
  feb21_ORG_NAME = c(NA, "Org A", "Org B", NA, "Org B"),
  mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA),
  apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")
)

# Initialize Breaks_in_Service column as FALSE
simp_2021$Breaks_in_Service <- FALSE

# View
print(simp_2021)

Expected output: In the sample data, Breaks_in_Service should be TRUE for IDs 1, 4, and 5, and FALSE for IDs 2 and 3.

I've tried building a for loop, but it's gotten messy and isn't working:

# Loop over each row to check for breaks in service
for (i in 1:nrow(simp_2021)) {
  row_values <- simp_2021[i, 2:ncol(simp_2021)]  # Extract service columns for the current row
  
  # Initialize flags to track service usage
  in_service <- FALSE
  found_break <- FALSE
  
  # Check transitions within the row
  for (j in 1:(length(row_values) - 1)) {
    current_value <- row_values[[j]]
    next_value <- row_values[[j + 1]]
    
    if (is.na(current_value) && !is.na(next_value)) {
      # Transition from not using service to using service
      in_service <- TRUE
    } else if (!is.na(current_value) && is.na(next_value)) {
      # Transition from using service to not using service
      if (in_service) {
        found_break <- TRUE
        break  # Found a break, no need to check further
      }
    }
  }
  
  # Set Breaks_in_Service based on found breaks
  if (found_break) {
    simp_2021$Breaks_in_Service[i] <- TRUE
  }
}

# View the updated dataframe with the new 'Breaks_in_Service' column
print(simp_2021)

Solution

  • You could, first, equal the month columns to binary !NA using +!is.na() and paste to binary strings, next sub away all starting 0s, finally grepl for the pattern 01.

    > (tmp <- apply(+!is.na(simp_2021[-1]), 1, paste, collapse=''))
    [1] "1001" "0100" "0111" "1010" "1101"
    > (tmp <- sub(tmp, pat='^0+', rep=''))
    [1] "1001" "100"  "111"  "1010" "1101"
    > (tmp <- grepl(tmp, pat='01'))
    [1]  TRUE FALSE FALSE  TRUE  TRUE
    

    Altogether in a nice pipe:

    > transform(simp_2021, 
    +           Breaks_in_Service=apply(+!is.na(simp_2021[-1]), 1, paste, collapse='') |> 
    +             sub(pat='^0+', rep='') |> 
    +             grepl(pat='01'))
      ID jan21_ORG_NAME feb21_ORG_NAME mar21_ORG_NAME apr21_ORG_NAME Breaks_in_Service
    1  1          Org A           <NA>           <NA>          Org B              TRUE
    2  2           <NA>          Org A           <NA>           <NA>             FALSE
    3  3           <NA>          Org B          Org B          Org C             FALSE
    4  4          Org B           <NA>          Org D           <NA>              TRUE
    5  5          Org B          Org B           <NA>          Org E              TRUE
    

    Note: this approach also uses row-wise apply, but on a "matrix" for which it is designed for and is therefore efficient.

    > is.matrix(+!is.na(simp_2021[-1]))
    [1] TRUE
    

    Data:

    > dput(simp_2021)
    structure(list(ID = c(1, 2, 3, 4, 5), jan21_ORG_NAME = c("Org A", 
    NA, NA, "Org B", "Org B"), feb21_ORG_NAME = c(NA, "Org A", "Org B", 
    NA, "Org B"), mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA
    ), apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")), class = "data.frame", row.names = c(NA, 
    -5L))