Search code examples
rdataframetextword-frequency

Adding dataframe column with frequency counts for several pre-specified words in R


I have a dataframe of thousands of news articles that looks like this:

id text date
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18
2 newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28
5 newyorktimes opinion section what needs to be down with about the rats 1980-1-29

I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:

id text date wordlistcount
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18 2
2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-1-22 4
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22 2
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28 1
4 newyorktimes opinion section what needs to be done with about the rats 1980-1-29 2

How can I accomplish this? Any help would be greatly appreciated.


Solution

  • In stringr, with str_count:

    library(stringr)
    library(dplyr)
    words = c("newyorktimes", "washingtonpost", "the")
    df %>% 
      mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))
    
    
    
    
    #   id                                                                       text      date wordlistcount
    # 1  1      newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18             2
    # 2  2       newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22             3
    # 3  3       a journalist for the washingtonpost went missing while on assignment 1980-1-22             2
    # 4  4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28             1
    # 5  5     newyorktimes opinion section what needs to be down with about the rats 1980-1-29             2