I have a corpus with two languages (the language information is saved in the docvar lang
) and want to remove stopwords depending on the docvar
value.
I am using a substantively nonsensical example to illustrate the point (since in the example corpus, all speeches are in English):
library(quanteda)
library(quanteda.corpora)
corp <- corpus(data_corpus_ungd2017) %>%
corpus_subset(country %in% c("Austria", "Australia"))
corp
now contains the speech by Austria and by Belgium; let's pretend they were given in German and in French: how do I selectively remove stopwords based on the value of, say country_iso
? I tried something like this unsuccessfully.
toks <- corp %>%
tokens() %>%
ifelse(docvars(field = "country_iso") == "AUT",
tokens_remove(stopwords("de")),
tokens_remove(stopwords("en")))
Error in ifelse(., docvars(field = "country_iso") == "AUT", tokens_remove(stopwords("de")), : unused argument (tokens_remove(stopwords("en")))
How can I best achieve this in one corpus?
This is an interesting problem! Currently in quanteda (<= v1.5.1) the list-like tokens
object do not allow assignment of document-level tokens elements, so lapply()
-based solutions cannot be used. The most efficient way is to segment the tokens object into same-language chunks and then apply stopwords removal to each chunk.
The UNDG (UN General Debate) corpus chosen as an example in your questions is not ideally suited to this example because it's entirely in English, so I've created a dual-language example below to illustrate the solution.
library("quanteda")
## Package version: 1.5.1
txt <- c(
Austria = "Dies ist ein Beispieltext in deutscher Sprache.",
Australia = "This is a sample English text.",
Germany = "Dies ist ein Beispieltext in deutscher Sprache.",
"United Kingdom" = "This is s sample English text."
)
corp <- corpus(txt,
docvars = data.frame(country = names(txt), stringsAsFactors = FALSE)
)
Now we need to create and set a language docvar. The language assignment function below could be expanded, or you could create a separate table of countries to languages and create the language document by some merging function. The exact method is not central to the solution, but you do need a language variable that will match the ISO-639-1 language codes that stopwords::stopwords()
takes as input.
# language assignemnt function
setlang <- Vectorize(
vectorize.args = "country",
FUN = function(country) {
switch(country,
"Austria" = "de",
"Germany" = "de",
"Australia" = "en",
"United States" = "en",
"United Kingdom" = "en"
)
}
)
# set a language docvar
docvars(corp, "lang") <- setlang(docvars(corp, "country"))
# inspect
summary(corp)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences country lang
## Austria 8 8 1 Austria de
## Australia 7 7 1 Australia en
## Germany 8 8 1 Germany de
## United Kingdom 7 7 1 United Kingdom en
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpflhaMi/reprex933852cb0380/* on x86_64 by kbenoit
## Created: Sat Nov 2 10:19:59 2019
## Notes:
Now we can loop across the tokens object and remove stopwords for each language segment. Below we need the if to check the NULL
for the first language, because (currently) we cannot return a tokens object with c(NULL, tokensobject)
as we could with c()
as implemented for other objects.
toks <- tokens(corp)
tokslist <- NULL
for (l in unique(docvars(toks, "lang"))) {
toksthislang <- tokens_subset(toks, lang == l) %>%
tokens_remove(stopwords(language = l), padding = TRUE)
tokslist <- if (!is.null(tokslist)) c(tokslist, toksthislang) else toksthislang
}
And now we might want to put them back in the original order, and then when we inspect it, we can see that the language-appropriate stopwords have been removed. The "pads" have been left in just so we can see this, for the purposes of this example, but you probably don't want to keep them so just set padding = FALSE
(the default) in the tokens_remove()
call above.
# put back into original order
toks <- tokslist[docnames(toks)]
lapply(toks, head)
## $Austria
## [1] "" "" "" "Beispieltext"
## [5] "" "deutscher"
##
## $Australia
## [1] "" "" "" "sample" "English" "text"
##
## $Germany
## [1] "" "" "" "Beispieltext"
## [5] "" "deutscher"
##
## $`United Kingdom`
## [1] "" "" "s" "sample" "English" "text"