I have a df, data:
data = data.frame("text" = c("John met Jay who met Jack who met Josh who met Jamie", "John and Jay and Jack and Josh and Jamie"),
"names.in.text" = c("Jay; Jack; Josh; Jamie", "John; Jack; Josh; Jamie"),
"missing.names" = c("",""))
> data
text names.in.text missing.names
1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie
2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie
and a second df of names:
names = data.frame("names" = c("John", "Jay", "Jack", "Josh", "Jamie"))
> names
names
1 John
2 Jay
3 Jack
4 Josh
5 Jamie
I am trying to find out whether data$names.in.text contains all the names contained in data$text. The universe of names is in names$names. Ideally, for each row of data$missing, I'd like to know which names$names is missing from data$names.in.text:
text names.in.text missing.names
1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie John
2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie Jay
Or any other configuration that would easily tell me what names are in the text but missing from names.in.text
So essentially I am looking to find what names$names are included in data$text but not data$names.in.text, and then list those names in data$missing.names.
A tidyverse
solution:
library(tidyverse)
data %>%
mutate(missing.names = map2_chr(text, str_split(names.in.text, '; '),
~ str_c(str_extract_all(.x, regex(str_c(setdiff(names$names, .y), collapse = '|')))[[1]], collapse = '; ')))
# # A tibble: 2 × 3
# text names.in.text missing.names
# <chr> <chr> <chr>
# 1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie John
# 2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie Jay