I want to know, whether a string in one column exists in a range of columns. The search column can contain several strings, separated by ", ". I call them "search terms". I don't care if one or several terms are found, but I need to know if there are duplicates. Here is some mock-data
df <- data.frame(
a=c("a","b","c, d","d"),
b=c(NA, "k", NA,"k"),
c=c("c1","c2","c3","c4, c5"),
search_terms=c("a",NA,"c, a","a, c5"))
df
a b c search_terms
1 a <NA> c1 a
2 b k c2 <NA>
3 c, d <NA> c3 c, a
4 d k c4, c5 a, c5
What I would like as a result is:
test
1 search term found in a_c
2 <NA>
3 search term found in a_c
4 search term found in a_c
explainer:
So its possible to search for a string in all substrings of the search column. The following code correctly identifies "c5" in row 4. But I fail to do it a rowwise match.
df %>% mutate(test=ifelse(sapply(strsplit(df$search_terms, ", "),
function(x) {any(x == "c5")}),
"search term found in a_c",NA)) %>%
select(test)
test
1 <NA>
2 <NA>
3 <NA>
4 search term found in a_c
I further manage to check the existence rowwise, but not when the input is a list of strings. This code correctly identifies the first match, but neither the third nor the fourth.
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
mutate(test=ifelse(mapply(grepl, search_terms,a_c),
"search term found in a_c",NA))%>%
select(test)
test
1 search term found in a_c
2 <NA>
3 <NA>
4 <NA>
I hoped to combine the two along the following lines, but failed with grepl
only taking the first element. So it correctly identifies the first and the third match, but fails to identify the match in row 4. So why does the any
-command not work here, but in the first line of code?
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
mutate(test=ifelse(apply(.,1,function(x) {
sapply(strsplit(x["search_terms"],", "), function(y) {
any(grepl(y,x["a_c"]))
})
}),"search_term in a_c",NA)
) %>%
select(test)
test
1 search term found in a_c
2 <NA>
3 search term found in a_c
4 <NA>
Warning messages:
1: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
2: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
There was an answer which now seems deleted. It didn't work as I wanted but gave some crucial insights for the following solution:
library(dplyr)
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
mutate(test=ifelse(apply(.,1,function(x) {
sapply(strsplit(x["search_terms"],", "), function(y) {
any(sapply(y, function(z) grepl(z,x["a_c"])))
})
}),"search_term in a_c",NA)
) %>%
select(test)
The thing is that strsplit returns a list and hence requires a sapply
command within the any
function. As I understand the functioning of those nested *apply
commands:
apply(.,…
makes sure the following function(x)
is applied on every row of df
sapply(strsplit(
applies strsplit
and outputs the list of search terms. For every list of search terms, the function(y)
is appliedany(sapply(
applies the grepl
function on every it of the list of search terms.However, I don't understand the logic fully and have the impression there is an easier way to solve that, with less *apply
-functions. I further could imagine that there could be a cleaner tidyr
-approach. However, the function gives me the desired output (here, with a slightly more complex df
).
df <- data.frame(
a=c("a","b","c, d","d", "e"),
b=c(NA, "x", NA,"y", NA),
c=c("c1","c2","c3","c4, c5", "c5, c6"),
search_terms=c("a",NA,"c, a","x", "l, c6"),stringsAsFactors = F)
library(dplyr)
df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
mutate(test=ifelse(apply(.,1,function(x) {
sapply(strsplit(x["search_terms"],", "), function(y) {
any(sapply(y, function(z) grepl(z,x["a_c"])))
})
}),"search_term in a_c",NA)
) %>%
select(-a_c)
a b c search_terms test
1 a <NA> c1 a search_term in a_c
2 b x c2 <NA> <NA>
3 c, d <NA> c3 c, a search_term in a_c
4 d y c4, c5 x <NA>
5 e <NA> c5, c6 l, c6 search_term in a_c