I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature | count |
---|---|
this | 1 |
column | 1 |
has | 1 |
a | 2 |
lot | 1 |
of | 1 |
almost | 1 |
i | 2 |
am | 1 |
interested | 1 |
in | 1 |
problems | 1 |
is | 1 |
headache | 1 |
how | 1 |
do | 1 |
you | 1 |
handle | 1 |
missing data | 4 |
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.
The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase()
wrapper in tokens_compound()
, you are telling tokens_replace()
to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup()
on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"