In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's name (Liberal Party) with the same word in different contexts (liberal helping).
I can do this below with str_replace_all
and I know about the token_compound()
function in quanteda, but it doesn't seem to do exactly what I need.
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
Should I somehow just preprocess the text before turning it into a corpus? Or is there a way to do this after turning it into a corpus in quanteda
.
Here is some expanded sample code that specifies the problem a little better:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
This example counts democratic
in both the new democratic
and the democratic
sense, but I would those counted separately.
You want the function tokens_lookup()
, after having defined a dictionary that defines the canonical party labels as keys, and lists all the ngram variations of the party names as values. By setting exclusive = FALSE
it will keep the tokens that are not matched, in effect acting as a substitution of all variations with the canonical party names.
In the example below, I've modified your input text a bit to illustrate the ways that the party names will be combined to be different from the phrases using "liberal" but not "liberal party".
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
So that has replaced the party name variances with the party keys. Constructing a dfm from this new tokens now occurs on these new tokens, preserving the uses of (e.g.) "liberal" that might be linked to sentiment, but having already combined the "liberal party" and replaced it with "OLP". Applying a dictionary to the dfm will now work for your example of "liberal" in "liberal helping" without having confused it with the use of "liberal" in the party name.
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
Two additional notes:
If you do not want the keys uppercased in the replacement tokens, set capkeys = FALSE
.
You can set different matching types using the valuetype
argument, including valuetype = regex
. (And note that your regular expression in the example is probably not correctly formed, since the scope of your |
operator in the ndp example will get "new democrats" OR "new" and then " democratic party". But with tokens_lookup()
you won't need to worry about that!)