Search code examples
rnlp

Count number of English words in string in R


I would like to count the number of English words in a string of text.

df.words <- data.frame(ID = 1:2,
              text = c(c("frog friend fresh frink foot"),
                       c("get give gint gobble")))

df.words

  ID                         text
1  1 frog friend fresh frink foot
2  2         get give gint gobble

I'd like the final product to look like this:

  ID                         text count
1  1 frog friend fresh frink foot     4
2  2         get give gint gobble     3

I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?


Solution

  • Building on @r2evans suggestion of using strsplit() and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest step.

    library(dplyr)
    library(tidyr)
    
    # text file with 479k English words ~4MB
    dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
    
    df.words <- data.frame(ID = 1:2,
                           text = c(c("frog friend fresh frink foot"),
                                    c("get give gint gobble")),
                           stringsAsFactors = FALSE)
    
    df.words %>% 
      mutate(text2 = strsplit(text, split = "\\s")) %>% 
      unnest(text2) %>% 
      semi_join(dict, by = c("text2")) %>% 
      group_by(ID, text) %>% 
      summarise(count = length(text2))
    

    Output

         ID text                         count
      <int> <chr>                        <int>
    1     1 frog friend fresh frink foot     4
    2     2 get give gint gobble             3