Search code examples
rtidyverse

Count occurrences in specific column ranges and return factor variable, R


I have data like this:

df<-structure(list(levels_incised___1 = c(0, 0, 0, 0, 0, 0), levels_incised___2 = c(1, 
0, 0, 0, 0, 0), levels_incised___3 = c(1, 0, 0, 0, 0, 0), levels_incised___4 = c(1, 
0, 0, 0, 0, 0), levels_incised___5 = c(1, 0, 0, 0, 0, 0), levels_incised___6 = c(1, 
0, 0, 0, 0, 0), levels_incised___7 = c(1, 0, 0, 0, 0, 0), levels_incised___8 = c(1, 
1, 1, 0, 0, 0), levels_incised___9 = c(1, 1, 1, 0, 0, 0), levels_incised___10 = c(1, 
1, 1, 0, 0, 0), levels_incised___11 = c(0, 1, 0, 0, 0, 0), levels_incised___12 = c(0, 
1, 0, 0, 0, 0), levels_incised___13 = c(0, 1, 0, 0, 0, 0), levels_incised___14 = c(0, 
1, 0, 0, 0, 0), levels_incised___15 = c(0, 1, 0, 0, 0, 0), levels_incised___16 = c(0, 
0, 0, 0, 0, 0), levels_incised___17 = c(0, 0, 0, 0, 0, 0), levels_incised___18 = c(0, 
0, 0, 0, 0, 0), levels_incised___19 = c(0, 0, 0, 0, 0, 0), levels_incised___20 = c(0, 
0, 0, 0, 0, 0), levels_incised___21 = c(0, 0, 0, 0, 0, 0), levels_incised___22 = c(0, 
0, 0, 0, 1, 0), levels_incised___23 = c(0, 0, 0, 0, 1, 1), levels_incised___24 = c(0, 
0, 0, 0, 1, 1), levels_incised___25 = c(0, 0, 0, 0, 1, 1), levels_incised___26 = c(0, 
0, 0, 0, 1, 1), levels_incised___27 = c(0, 0, 0, 1, 1, 1), levels_incised___28 = c(0, 
0, 0, 1, 1, 1), levels_incised___29 = c(0, 0, 0, 1, 1, 0), levels_incised___30 = c(0, 
0, 0, 1, 1, 0), levels_incised___31 = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

That originally came from this Redcap input where each button was one of those columns: enter image description here

And I need to create a column at the end (lets call it Level) with these possible inputs:

  • Cervical (any of the c buttons)
  • Thoracic (the t's)
  • Lumbar (the L's)
  • Sacral (sacral)
  • Thoracocervical (t's or c's)
  • Thoracolumbar (t's or l's)
  • Lumbosacral (l's and sacral)

So for instance, the patient in the first row had "1"'s in levels_incised_2 through levels_incised_10... meaning they had values in both the cervical range and the thoracic range. So that patient should get "Thoracocervical".

The patient in row 2 had 1's in 8 through 15, so they'd only get a "thoracic"

Does anyone know the most straight forward way to accomplish this?

Oh one last detail, there's 100+ other columns so it'd be nice if I could select/name these specific ones to count


Solution

  • A few things to resolve here:

    • find a way to convert levels...# to one of the C/T/... categories;
    • produce logic to infer based on presence of groups.

    I think the first can be done by extracting the number and using findInterval to determine with of C/T/... each column belongs to. From there, we can do some simple c_across to find "any" in a group, and case_when to get your Level labels.

    library(dplyr)
    
    # helper function for renaming
    func <- function(z) {
      num <- as.integer(gsub("\\D", "", z))
      grp <- c("C","T","L","S","Co","unclear")[findInterval(num, 1+c(0, 7, 19, 24, 29, 30, 31))]
      grp <- paste0(grp, ave(grp, grp, FUN = seq_along))
      # fix those that do not need numbering
      grp[grepl("^Co", grp)] <- "Co"
      grp[grepl("^unc", grp)] <- "unclear"
      grp
    }
    
    out <- df %>%
      rename_with(.cols = starts_with("levels"), .fn = func) %>%
      rowwise() %>%
      mutate(
        anyC = sum(c_across(C1:C7)) > 0, 
        anyT = sum(c_across(T1:T12)) > 0, 
        anyL = sum(c_across(L1:L5)) > 0, 
        anyS = sum(c_across(S1:S5)) > 0
      ) %>%
      ungroup() %>%
      mutate(
        Level = case_when(
          anyC & anyT & anyL ~ "More than 2?", 
          anyL & anyS ~ "Lumbosacral", 
          anyT & anyL ~ "Thoracolumbar", 
          anyT & anyC ~ "Thoracocervical", 
          anyS ~ "Sacral", 
          anyL ~ "Lumbar", 
          anyT ~ "Thoracic", 
          anyC ~ "Cervical", 
          TRUE ~ "Nothing?"
        )
      )
    out
    # # A tibble: 6 x 36
    #      C1    C2    C3    C4    C5    C6    C7    T1    T2    T3    T4    T5    T6    T7    T8    T9   T10   T11   T12    L1    L2    L3    L4    L5    S1    S2    S3    S4    S5    Co unclear anyC  anyT  anyL  anyS  Level          
    #   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <lgl> <lgl> <lgl> <lgl> <chr>          
    # 1     0     1     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 TRUE  TRUE  FALSE FALSE Thoracocervical
    # 2     0     0     0     0     0     0     0     1     1     1     1     1     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 FALSE TRUE  FALSE FALSE Thoracic       
    # 3     0     0     0     0     0     0     0     1     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0       0 FALSE TRUE  FALSE FALSE Thoracic       
    # 4     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1       0 FALSE FALSE FALSE TRUE  Sacral         
    # 5     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1     1     1     1     1     1       0 FALSE FALSE TRUE  TRUE  Lumbosacral    
    # 6     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     1     1     1     1     1     0     0       0 FALSE FALSE TRUE  TRUE  Lumbosacral    
    out$Level
    # [1] "Thoracocervical" "Thoracic"        "Thoracic"        "Sacral"          "Lumbosacral"     "Lumbosacral"    
    

    If you don't want to keep the renaming, then you can combine the Level result to your original frame with cbind(df, Level = out$Level).