Search code examples
rstring-matchingtmquanteda

Approximate string matching in R between two datasets


I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not:

dt1

   title                                        genre

   Secret in Their Eyes                         Dramas
   V for Vendetta                               Action & Adventure
   Bottersnikes & Gumbles                       Kids' TV
   ...                                          ...

and

dt2

id      Text
1.      "I really liked V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4       "@thewitcher was an interesting series
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta" 
... etc

what I want to obtain is a function that matched those titles in dt1 and tries to find them in the text in dt2:

if it finds any match or approximate match I want to have a column in dt2 that tells with the title that was mentioned in the text. if more than one is mentioned I want a any titles separated by a comma.

dt2

id      Text                                                                       mentions
1.      "I really liked V for Vendetta"                                            "V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "                            "Bottersnikes & Gumbles"
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."   "Bottersnikes & Gumbles"
4       "@thewitcher was an interesting series                                       NA
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta"          "Secret in Their Eyes, V for Vendetta" 
... etc

Solution

  • You can do the fuzzy matching via agrep(), which here I've used for each title with lapply() to generate a logical vector of matches for each Text, and then used an apply() across a data.frame from this match to create the vector of matched titles.

    You can tweak the max.distance value but this worked just fine on your example.

    dt1 <- data.frame(
      title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
      genre = c("Dramas", "Action & Adventure", "Kids' TV"),
      stringsAsFactors = FALSE
    )
    
    dt2 <- data.frame(
      id = 1:5,
      Text = c(
        "I really liked V for Vendetta",
        "Bottersnikes & Gumbles was a great film .... ",
        "In any case, in my opinion bottersnikes &gumbles was a great film ...",
        "@thewitcher was an interesting series",
        "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
      ),
      stringsAsFactors = FALSE
    )
    
    match_titles <- function(target, titles) {
      matches <- lapply(titles, agrepl, target,
        max.distance = 0.3,
        ignore.case = TRUE, fixed = TRUE
      )
      matched_titles <- apply(
        data.frame(matches), 1,
        function(y) paste(titles[y], collapse = ",")
      )
      matched_titles
    }
    
    dt2$titles <- match_titles(dt2$Text, dt1$title)
    dt2
    ##   id                                                                  Text
    ## 1  1                                         I really liked V for Vendetta
    ## 2  2                         Bottersnikes & Gumbles was a great film .... 
    ## 3  3 In any case, in my opinion bottersnikes &gumbles was a great film ...
    ## 4  4                                 @thewitcher was an interesting series
    ## 5  5     Secret in Their Eye is a terrible film! but I Like V per Vendetta
    ##                                titles
    ## 1                      V for Vendetta
    ## 2              Bottersnikes & Gumbles
    ## 3              Bottersnikes & Gumbles
    ## 4                                    
    ## 5 Secret in Their Eyes,V for Vendetta