I have two data sets, Review Data & Topic Data
Dput code of my Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput code of my Topic Data
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):
library(dplyr)
review %>%
full_join(topic, by = character()) %>% # full cartesian expansion
group_by(word) %>%
mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
ungroup() %>%
filter(matched) %>%
select(-word, -matched)
# # A tibble: 2 x 2
# Review Topic
# <fct> <fct>
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved "Canteen"
It's a little brute-force in that it does a cartesian join of the frames before testing with grepl
, but ... you can't really avoid some parts of that.
You can also use the fuzzyjoin
package, which is meant for joins on fuzzy things (appropriately named).
fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
# Review word Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports
# 2 Canteen Food could be improved canteen food Canteen
The warning is because your columns are factor
s, not character
, it should be harmless. If you want to hide the warning, you can use suppressWarnings
(a little strong); if you want to prevent the warning, convert all applicable columns from factor
to character
(e.g., topic[] <- lapply(topic, as.character)
, same for review$Review
, though modify it if you have numeric columns).