I have a dataframe of thousands of news articles that looks like this:
id | text | date |
---|---|---|
1 | newyorktimes leaders gather for the un summit in next week to discuss | 1980-1-18 |
2 | newyorktimes opinion section what the washingtonpost got wrong about | 1980-1-22 |
3 | a journalist for the washingtonpost went missing while on assignment | 1980-1-22 |
4 | washingtonpost president carter responds to criticisms on economic decline | 1980-1-28 |
5 | newyorktimes opinion section what needs to be down with about the rats | 1980-1-29 |
I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:
id | text | date | wordlistcount |
---|---|---|---|
1 | newyorktimes leaders gather for the un summit in next week to discuss | 1980-1-18 | 2 |
2 | newyorktimes opinion section what the washingtonpost and newyorktimes got wrong | 1980-1-22 | 4 |
3 | a journalist for the washingtonpost went missing while on assignment | 1980-1-22 | 2 |
4 | washingtonpost president carter responds to criticisms on economic decline | 1980-1-28 | 1 |
4 | newyorktimes opinion section what needs to be done with about the rats | 1980-1-29 | 2 |
How can I accomplish this? Any help would be greatly appreciated.
In stringr
, with str_count
:
library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>%
mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))
# id text date wordlistcount
# 1 1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18 2
# 2 2 newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22 3
# 3 3 a journalist for the washingtonpost went missing while on assignment 1980-1-22 2
# 4 4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28 1
# 5 5 newyorktimes opinion section what needs to be down with about the rats 1980-1-29 2