Search code examples
rstring-matching

ifelse statement to assign values to a new column, working with lists of numeric values


I have a dataframe that looks something like this:

# Minimal example dataframe

identifier <- c(
  "A",
  "B",
  "C",
  "D",
  "E",
  "F"
)

value_1 <- c(
  "1231811, 1231877",
  "1231911, 1233069, 1232767",
  "1231919",
  NA,
  "1232135, 1233145",
  NA
)

value_2 <- c(
  1231811,
  190477,
  922661,
  950711,
  992647,
  NA
  
)

value_3 <- c(
  1231877,
  1233069,
  9774041,
  9774041,
  1314063,
  1231379
  
)

test_df <- data.frame(identifier, value_1, value_2, value_3)

  identifier                   value_1 value_2 value_3
1          A          1231811, 1231877 1231811 1231877
2          B 1231911, 1233069, 1232767  190477 1233069
3          C                   1231919  922661 9774041
4          D                      <NA>  950711 9774041
5          E          1232135, 1233145  992647 1314063
6          F                      <NA>    <NA> 1231379

I want to create a new column, "final_value," and fill it in with a single value from value_1, value_2, or value_3 following a hierarchy that prioritizes value_1 values that match values in value_2 followed by value_3. If value_1 is not NA and does not have values that match anything in value_2 or value_3, I want to fill final_value with the first value in the comma-separated value_1 string. If value_1 is NULL, fill final_value with value_2 or, if that is also null, fill in with value_3. The final dataframe would look like this:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2          B 1231911, 1233069, 1232767  190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3          C                   1231919  922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4          D                      <NA>  950711 9774041 950711  # value_1 is NA, so fill in with value_2
5          E          1232135, 1233145  992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6          F                      <NA>    <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3

Here's my approach so far...

library(purrr)
library(dplyr)

# change value_1 column into a list of numeric values 
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))

# create a new column to hold the final selected value
test_df$final_value <- NA

# ifelse statement
test_df$final_value <- 
  
  # if any of the elements in value_1 match the value_2 value, fill the new column with value_2
  ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
         
         # otherwise, if a value in value_1 matches value_3, fill in with value_3
         ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
                
                # if none of the values in value_1 match the other columns, fill in with the first value_1 list value
                ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
                       
                       # if value_1 is NA, fill in with value_2
                       ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
                              
                              # if value_1 is NA and value_2 is NA, fill in with value_3
                              ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
         
         
  )))))

There are a few problems with the result:

  identifier                   value_1 value_2 value_3               final_value
1          A          1231811, 1231877 1231811 1314063          1231811, 1231877
2          B 1231911, 1233069, 1232767  190477 1233069 1231911, 1233069, 1232767
3          C                   1231919  922661 9774041                   1231919
4          D                        NA  950711 9774041                    950711
5          E          1232135, 1233145  992647 1314063          1232135, 1233145
6          F                        NA      NA 1231379                   1231379

The first three lines of the ifelse are not working as anticipated. It is failing to return the matching value_2 or value_3 value in final_value and I also cannot get it to return the first list item from value_1 where there aren't any matching value_2 or value_3 values. For the latter, I've tried specifying test_df$value_1[[1]][1] (and similar) but this only returns the first item in the identifer A value_1 list:

  identifier                   value_1 value_2 value_3 final_value
1          A          1231811, 1231877 1231811 1314063     1231811
2          B 1231911, 1233069, 1232767  190477 1233069     1231811
3          C                   1231919  922661 9774041     1231811
4          D                        NA  950711 9774041      950711
5          E          1232135, 1233145  992647 1314063     1231811
6          F                        NA      NA 1231379     1231379

Any help would be greatly appreciated.


Solution

  • First, nesting ifelse beyond 2-deep generally leads me to suggest case_when. However, in this case I think there is a much better solution without that:

    func func <- function(A, ...) {
      if (length(A) == 1L && is.na(A)) {
        if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
      } else {
        L <- lapply(list(...), intersect, x = A)
        L <- c(L[lengths(L) > 0], A)
        L[[1]][1]
      }
    }
    
    library(dplyr)
    test_df %>%
      mutate(
        final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
      )
    #   identifier                   value_1 value_2 value_3 final_value
    # 1          A          1231811, 1231877 1231811 1231877     1231811
    # 2          B 1231911, 1233069, 1232767  190477 1233069     1233069
    # 3          C                   1231919  922661 9774041     1231919
    # 4          D                      <NA>  950711 9774041      950711
    # 5          E          1232135, 1233145  992647 1314063     1232135
    # 6          F                      <NA>      NA 1231379     1231379
    

    Because I use ... in func, this handles "0 or more" other value_* variables as you want; if you have 3 or 30 more, it will apply the same logic. Further, the order within ... matters, those listed earlier will be prioritized higher for matches.

    The c(L[lengths(L) > 0], A) ensures (1) we only consider value_* that have non-empty intersections (first portion), and if all of those are empty, we use what is found in A. (In the unlikely event that A is NA and all value_* are empty, then ... you get NA.)

    FYI, one inner-step of this is to split your strings of comma-separated numbers into a list-column using strsplit. If you're going to do more and similar operations that need to work on individual components within, you may prefer to keep it as such using mutate(value_1 = strsplit(value_1, "[ ,]+")) (or similar).