Search code examples
rconditional-statementsdplyrlevels

How to perform conditional mutate on unspecified columns (e.g. matching regex)?


I have a nested data.frame - df_nested, there one of column contains df:

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, "text2"), 
             xyz002 = c(NA, NA, NA, "text3", "text1", NA, NA, NA),
             xyz003 = c(NA, "text1", NA, NA, "text2", NA, "text2", NA)) 

I want to find a way, how to mutate this df, on these requirements:

  1. mutate(across(matches("\\d")
  2. there are 4 cases - 4 levels by priority. text4 <- text3 <- text2 <- text1: I need to find and keep column values containing only the highest level text. e.g. if column contains text4, I want to remove text3, text2, text1 and replace them to NA. If it contains multiple highest order text, we should keep all these values (e.g. column xyz003).
  3. how to apply these conditions not specifying column name, because there can be any number in column name.
  4. if column contains all NAs, do nothing.

my attempt:

df_nested <- df_nested %>%
    mutate(df = map(data, ~.x %>%
       mutate(across(matches("\\dd"), function (x) {
                      conditions (ifelse, case_when or other)
                      ...}

Also, should we better use across(), or is vars() still a good way to do it as well? Thank you in advance.

Expected Output

df <- tibble(ID_Value = 1:8,
             xyz001 = c("text4", NA, NA, NA, NA, NA, NA, NA), 
             xyz002 = c(NA, NA, NA, "text3", NA, NA, NA, NA),
             xyz003 = c(NA, NA, NA, NA, "text2", NA, "text2", NA))

Solution

    1. Use a factor type to specify the order you want.
    2. Do row- or column-wise operations across the matches.

    Consider this function

    max_only <- function(x, lvls) {
      fct <- droplevels(factor(x, lvls))
      `[<-`(x, as.integer(fct) != length(levels(fct)), NA_character_)
    }
    

    You can then specify any order you want

    > max_only(c("apple", "banana", NA_character_), c("banana", "apple"))
    [1] "apple" NA      NA     
    > max_only(c("apple", "banana", NA_character_), c("apple", "banana"))
    [1] NA       "banana" NA   
    

    Case 1: column-wise operations

    df %>% 
      mutate(across(matches("\\d"), max_only, c("tier1", "tier2", "tier3", "tier4")))
    

    Output (this one looks more like your expected output)

    # A tibble: 8 x 4
      ID_Value xyz001 xyz002 xyz003
         <int> <chr>  <chr>  <chr> 
    1        1 tier4  NA     NA    
    2        2 NA     NA     NA    
    3        3 NA     NA     NA    
    4        4 NA     tier3  NA    
    5        5 NA     NA     tier2 
    6        6 NA     NA     NA    
    7        7 NA     NA     tier2 
    8        8 NA     NA     NA    
    

    Case 2: row-wise operations

    df %>% 
      mutate(as.data.frame(t(apply(
        across(matches("\\d")), 1L, 
        max_only, c("tier1", "tier2", "tier3", "tier4")
      ))))
    

    Output

    # A tibble: 8 x 4
      ID_Value xyz001 xyz002 xyz003
         <int> <chr>  <chr>  <chr> 
    1        1 tier4  NA     NA    
    2        2 NA     NA     tier1 
    3        3 NA     NA     NA    
    4        4 NA     tier3  NA    
    5        5 NA     NA     tier2 
    6        6 NA     NA     NA    
    7        7 NA     NA     tier2 
    8        8 tier2  NA     NA    
    

    Explanations

    1. [<- is almost equivalent to x[...] <- y; x. If ... is a logical vector (i.e. TRUE/FALSE), then values in x indexed by TRUE will be replaced by y. For example,

       > x <- c("a", "b" ,"c")
       > `[<-`(x, c(FALSE, TRUE, TRUE), NA_character_)
       [1] "a" NA  NA 
       > x[c(FALSE, TRUE, TRUE)] <- NA_character_; x
       [1] "a" NA  NA 
      
    2. NA_character_ is the NA value of a character type.

    3. as.integer(fct) != length(levels(fct)) returns a logical vector of the same length as fct. TRUE indexes positions where the values of fct are not the highest level, FALSE indexes the opposite, and NA indexes NAs. For example, assume that fct looks like this

       > x <- c("apple", "banana", NA)
       > fct <- droplevels(factor(x, c("apple", "banana", "pear")))
       > fct
       [1] apple  banana <NA>  
       Levels: apple banana
      

      Then, you can see that

       > as.integer(fct) != length(levels(fct))
       [1]  TRUE FALSE    NA 
      
    4. All together, it just means that we assign NA_character_s to values that are not equal to the highest level but keep NAs unchanged.

      [<-(x, as.integer(fct) != length(levels(fct)), NA_character_)