Search code examples
rweb-scrapingdata-cleaningdata-processingdata-collection

Match a keyword with 2 or more word from a sentence of word in R


I am trying to obtain specific key word from cricket commentary, some of the keyword I am looking for are a combination of 2 to 3 word in a list so,

This is the list of keywords in am looking in the commentary

region <- c("third man", "deep fine leg", "long leg", "deep square leg", "Deep mid wicket",
            "cow corner", "long on", "Deep extra cover", "Deep Cover", "Deep point",
            "Deep backword point", "fly slip", "backword point", "point", "cover", "Extra covers",
            "mid off", "mid on", "mid wicket", "square leg", "backword square leg", "fine leg",
            "slips", "gully", "silly point", "silly mid off", "silly mid on", "short leg", 
            "leg gully", "leg slip")

*Pretorius to Umesh Yadav, 1 run, pitched up by Pretorius, touch slower as it has been driven along the ground to long-off

Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover

Pretorius to Umesh Yadav, 1 run, touch fuller on off, Umesh Yadav drills it to long-off for a single*

How do I match the keyword from the commentary when there is a combination of 2 or more words for a particular ball. I am excepting which word from the above-mentioned list has matched with the commentary
I am using R version 4.2.1 and RStudio


Solution

  • It would be best to preprocess your sentences and keywords before doing the match (i.e. convert to lowercase, remove punctuations, etc.).

    For example, your sentence

    Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover

    won't match something from your region vector due to the fact that your respective value has not all characters as lowercase ones.

    Not sure about your desired output but for returning the matches of each sentence, I would do something like this using dplyr and stringr.

    library(stringr)
    library(dplyr)
    
    sentence <- data.frame(sens = c("Pretorius to Umesh Yadav, 1 run, pitched up by Pretorius, touch slower as it has been driven along the ground to long-off",
                                    "Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover"))
    
    region <- c("third man", "deep fine leg", "long leg", "deep square leg", "Deep mid wicket",
                "cow corner", "long on", "Deep extra cover", "Deep Cover", "Deep point",
                "Deep backword point", "fly slip", "backword point", "point", "cover", "Extra covers",
                "mid off", "mid on", "mid wicket", "square leg", "backword square leg", "fine leg",
                "slips", "gully", "silly point", "silly mid off", "silly mid on", "short leg", 
                "leg gully", "leg slip")
    
    sentence %>%
      rowwise() %>%
      mutate(match = paste0(str_extract_all(tolower(sens), paste0(tolower(region), collapse = "|"), simplify = TRUE), collapse = "|"))