Search code examples
rdatabasedataframeredundancy

R: Determine redundancies and unique values in a set of columns


I'm hoping to determine when the values in a set of columns are redundant, documenting it in a new column multi? where 0 means only a single value is seen, 1 means multiple values are seen. When the value "Unspecified" is with other values, I'd like the code to ignore it and assess the redundancy of other values accordingly. When the value "Unspecified" is the only value in the set of columns, I would like the column multi? to document "Unspecified".

Of note, these four columns are just a segment of a much larger database with many more columns.

To illustrate what I mean, I have provided an example input and output below:

  headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1      Intracerebral      Intracerebral      Intracerebral               <NA>      
2      Intracerebral      Subarachnoid                <NA>           Subdural      
3        Unspecified      Intracerebral           Subdural      Intracerebral      
4        Unspecified               <NA>                <NA>               <NA>               
5               <NA>               <NA>                <NA>               <NA>               

If the row was 1 for Multi?, then I'd also like to document the number of unique values in the new column Number

  Multi?       Number
1 0            1
2 1            3
3 1            2
4 Unspecified  1
5 NA           NA 

Solution

  • That's really cumbersome and I would really advise against mixing numbers and characters in a column. Having said that and if you are open to a dplyr based solution

    library(dplyr)
    
    data %>% 
      rowwise() %>% 
      summarise(
        number = n_distinct(
          c_across(headbleed_type_dx1:headbleed_type_dx4), 
          na.rm = TRUE),
        unspec = coalesce(
          any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"), 
          FALSE)) %>% 
      mutate(
        number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
        multi = case_when(number == 1 & unspec ~ "Unspecific",
                          number2 == 1 ~ "0",
                          is.na(number2) ~ NA_character_,
                          TRUE ~ "1"),
        .keep = "none") %>% 
      select(number = number2, multi)
    

    This returns

    # A tibble: 6 × 2
      number multi     
       <int> <chr>     
    1      1 0         
    2      3 1         
    3      2 1         
    4      1 Unspecific
    5     NA NA        
    6      1 0       
    

    Data

    structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral", 
    "Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral", 
    "Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral", 
    NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA, 
    "Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
        row = 1:4, col = c(NA_character_, NA_character_, NA_character_, 
        NA_character_), expected = c("4 columns", "4 columns", "4 columns", 
        "4 columns"), actual = c("5 columns", "5 columns", "5 columns", 
        "5 columns"), file = c("literal data", "literal data", "literal data", 
        "literal data")), row.names = c(NA, -4L), class = c("tbl_df", 
    "tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
    "data.frame"), row.names = c(NA, -6L), spec = structure(list(
        cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character", 
        "collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character", 
        "collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character", 
        "collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character", 
        "collector"))), default = structure(list(), class = c("collector_guess", 
        "collector")), skip = 1L), class = "col_spec"))