Search code examples
rdplyrtext-miningtmtidytext

Mapping the topic of the review in R


I have two data sets, Review Data & Topic Data

Dput code of my Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Dput code of my Topic Data

structure(list(word = structure(2:1, .Label = c("canteen food", 
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", 
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor"), 
    Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Solution

  • What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):

    library(dplyr)
    review %>%
      full_join(topic, by = character()) %>% # full cartesian expansion
      group_by(word) %>%
      mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
      ungroup() %>%
      filter(matched) %>%
      select(-word, -matched)
    # # A tibble: 2 x 2
    #   Review                                                   Topic    
    #   <fct>                                                    <fct>    
    # 1 Sports and physical exercise need to be given importance "Sports "
    # 2 Canteen Food could be improved                           "Canteen"
    

    It's a little brute-force in that it does a cartesian join of the frames before testing with grepl, but ... you can't really avoid some parts of that.

    You can also use the fuzzyjoin package, which is meant for joins on fuzzy things (appropriately named).

    fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
    # Warning: Coercing `pattern` to a plain character vector.
    #                                                     Review                word   Topic
    # 1 Sports and physical exercise need to be given importance sports and physical Sports 
    # 2                           Canteen Food could be improved        canteen food Canteen
    

    The warning is because your columns are factors, not character, it should be harmless. If you want to hide the warning, you can use suppressWarnings (a little strong); if you want to prevent the warning, convert all applicable columns from factor to character (e.g., topic[] <- lapply(topic, as.character), same for review$Review, though modify it if you have numeric columns).