I have a dataframe that looks something like this:
# Minimal example dataframe
identifier <- c(
"A",
"B",
"C",
"D",
"E",
"F"
)
value_1 <- c(
"1231811, 1231877",
"1231911, 1233069, 1232767",
"1231919",
NA,
"1232135, 1233145",
NA
)
value_2 <- c(
1231811,
190477,
922661,
950711,
992647,
NA
)
value_3 <- c(
1231877,
1233069,
9774041,
9774041,
1314063,
1231379
)
test_df <- data.frame(identifier, value_1, value_2, value_3)
identifier value_1 value_2 value_3
1 A 1231811, 1231877 1231811 1231877
2 B 1231911, 1233069, 1232767 190477 1233069
3 C 1231919 922661 9774041
4 D <NA> 950711 9774041
5 E 1232135, 1233145 992647 1314063
6 F <NA> <NA> 1231379
I want to create a new column, "final_value," and fill it in with a single value from value_1, value_2, or value_3 following a hierarchy that prioritizes value_1 values that match values in value_2 followed by value_3. If value_1 is not NA and does not have values that match anything in value_2 or value_3, I want to fill final_value with the first value in the comma-separated value_1 string. If value_1 is NULL, fill final_value with value_2 or, if that is also null, fill in with value_3. The final dataframe would look like this:
identifier value_1 value_2 value_3 final_value
1 A 1231811, 1231877 1231811 1231877 1231811 # 1231811 from value_1 matches value_2 (preferred match)
2 B 1231911, 1233069, 1232767 190477 1233069 1233069 # no values from value_1 match value_2; however, 1233069 from value_1 matches value_3
3 C 1231919 922661 9774041 1231919 # no values from value_1 match other columns; just fill with value_1
4 D <NA> 950711 9774041 950711 # value_1 is NA, so fill in with value_2
5 E 1232135, 1233145 992647 1314063 1232135 # no values from value_1 match other columns, fill with first item from value_1 list
6 F <NA> <NA> 1231379 1231379 # value_1 and value_2 are NA, so fill in with value_3
Here's my approach so far...
library(purrr)
library(dplyr)
# change value_1 column into a list of numeric values
test_df <- test_df%>% mutate(value_1 = map(value_1,function(x) (as.numeric(unlist(str_split(x,","))))))
# create a new column to hold the final selected value
test_df$final_value <- NA
# ifelse statement
test_df$final_value <-
# if any of the elements in value_1 match the value_2 value, fill the new column with value_2
ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_2, test_df$value_2,
# otherwise, if a value in value_1 matches value_3, fill in with value_3
ifelse(!is.na(test_df$value_1) & test_df$value_1 %in% test_df$value_3, test_df$value_3,
# if none of the values in value_1 match the other columns, fill in with the first value_1 list value
ifelse(!is.na(test_df$value_1) & !(test_df$value_1 %in% test_df$value_2) & !(test_df$value_1 %in% test_df$value_3), test_df$value_1, #NOTE: have tried test_df$value_1[1] and test_df$value_1[[1]] without success to get the first list item returned
# if value_1 is NA, fill in with value_2
ifelse(is.na(test_df$value_1) & !is.na(test_df$value_2), test_df$value_2,
# if value_1 is NA and value_2 is NA, fill in with value_3
ifelse(is.na(test_df$value_1) & is.na(test_df$value_2) & !is.na(test_df$value_3), test_df$value_3, NA
)))))
There are a few problems with the result:
identifier value_1 value_2 value_3 final_value
1 A 1231811, 1231877 1231811 1314063 1231811, 1231877
2 B 1231911, 1233069, 1232767 190477 1233069 1231911, 1233069, 1232767
3 C 1231919 922661 9774041 1231919
4 D NA 950711 9774041 950711
5 E 1232135, 1233145 992647 1314063 1232135, 1233145
6 F NA NA 1231379 1231379
The first three lines of the ifelse are not working as anticipated. It is failing to return the matching value_2 or value_3 value in final_value and I also cannot get it to return the first list item from value_1 where there aren't any matching value_2 or value_3 values. For the latter, I've tried specifying test_df$value_1[[1]][1]
(and similar) but this only returns the first item in the identifer A value_1 list:
identifier value_1 value_2 value_3 final_value
1 A 1231811, 1231877 1231811 1314063 1231811
2 B 1231911, 1233069, 1232767 190477 1233069 1231811
3 C 1231919 922661 9774041 1231811
4 D NA 950711 9774041 950711
5 E 1232135, 1233145 992647 1314063 1231811
6 F NA NA 1231379 1231379
Any help would be greatly appreciated.
First, nesting ifelse
beyond 2-deep generally leads me to suggest case_when
. However, in this case I think there is a much better solution without that:
func func <- function(A, ...) {
if (length(A) == 1L && is.na(A)) {
if (length(list(...))) na.omit(unlist(list(...)))[1] else NA
} else {
L <- lapply(list(...), intersect, x = A)
L <- c(L[lengths(L) > 0], A)
L[[1]][1]
}
}
library(dplyr)
test_df %>%
mutate(
final_value = mapply(func, strsplit(value_1, "[, ]+"), value_2, value_3)
)
# identifier value_1 value_2 value_3 final_value
# 1 A 1231811, 1231877 1231811 1231877 1231811
# 2 B 1231911, 1233069, 1232767 190477 1233069 1233069
# 3 C 1231919 922661 9774041 1231919
# 4 D <NA> 950711 9774041 950711
# 5 E 1232135, 1233145 992647 1314063 1232135
# 6 F <NA> NA 1231379 1231379
Because I use ...
in func
, this handles "0 or more" other value_*
variables as you want; if you have 3 or 30 more, it will apply the same logic. Further, the order within ...
matters, those listed earlier will be prioritized higher for matches.
The c(L[lengths(L) > 0], A)
ensures (1) we only consider value_*
that have non-empty intersections (first portion), and if all of those are empty, we use what is found in A
. (In the unlikely event that A
is NA
and all value_*
are empty, then ... you get NA
.)
FYI, one inner-step of this is to split your strings of comma-separated numbers into a list-column using strsplit
. If you're going to do more and similar operations that need to work on individual components within, you may prefer to keep it as such using mutate(value_1 = strsplit(value_1, "[ ,]+"))
(or similar).