Search code examples
rstringtidyversestringrstringi

Count number of exactly matching words in a string


I have a tibble with an id column and a column that capture some text_entry that people inputted.
Goal: Compare each person's text_entry to a key and count the number of perfectly typed words.
For example, if my inputs were:

df <- tribble(~id, ~text_entry,
              1, "It was a Saturday night in December.",
              2, " It was a Saturday night",
              3, "It wuz a Sturday nite in",
              4, "IT WAS A SATURDAY",
              5, "was a Saturday"); df

key <- "It was a Saturday night in December."

Then I would want the following:

df2 <- tribble(~id, ~text_entry, ~words_correct, 
               1, "It was a Saturday night in December.", 7, # whole string perfect
               2, " It was a Saturday night", 5,             # first 5 words perfect
               3, "It wuz a Sturday nite in", 3,             # misspelled "was", "Saturday" and "night"
               4, "IT WAS A SATURDAY", 0,                    # case-sensitive
               5, "was a Saturday", 3); df2                  # ok to start several words into the key

I'm completely striking out with stringr/stringi solutions. tidyverse always preferred, but I'm desperate for any solution.

Thanks so, SO much for your help & insights in advance!


Solution

  • One way would be to split the string on whitespace and count the common words with key.

    library(tidyverse)
    
    keywords <- strsplit(key, '\\s+')[[1]]
    
    df %>%
      mutate(text = str_split(text_entry, '\\s+'), 
             words_correct = map_dbl(text, ~sum(.x %in% keywords)))
    
    # A tibble: 5 x 3
    #     id text_entry                             words_correct
    #  <dbl> <chr>                                          <dbl>
    #1     1 "It was a Saturday night in December."             7
    #2     2 " It was a Saturday night"                         5
    #3     3 "It wuz a Sturday nite in"                         3
    #4     4 "IT WAS A SATURDAY"                                0
    #5     5 "was a Saturday"                                   3
    

    We can also do this in base R :

    df$words_correct <- sapply(strsplit(df$text_entry, '\\s+'), 
                               function(x) sum(x %in% keywords))