I got data like this (simplified):
library(quanteda)
sample data
myText <- c("ala ma kotka", "kasia ma pieska")
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)
tokenization
tokens <- tokens(myDF$myText, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
stemming with my own data sample dictionary
Origin <- c("kot", "pies")
Word <- c("kotek","piesek")
myDict <- data.frame(Origin, Word)
myDict$Origin <- as.character(myDict$Origin)
myDict$Word <- as.character(myDict$Word)
what i got
tokens[1]
[1] "Ala" "ma" "kotka"
what i would like to get
tokens[1]
[1] "Ala" "ma" "kot"
tokens[2]
[1] "Kasia" "ma" "pies"
A similar question has been answered here, but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes.
The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. You can then use tokens_lookup()
with the exclusive = FALSE, capkeys = FALSE
options to convert the inflected terms into their stems.
Note that I have modified your example a little to simplify it, and to correct what I think were mistakes.
library("quanteda")
packageVersion("quanteda")
[1] ‘0.99.9’
# no need for the data.frame() call
myText <- c("ala ma kotka", "kasia ma pieska")
toks <- tokens(myText,
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
Origin <- c("kot", "kot", "pies", "pies")
Word <- c("kotek", "kotka", "piesek", "pieska")
Then we create the dictionary, as follows. As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. Here, I had to add new values since the inflected forms in your original Word
vector were not found in the myText
example.
temp_list <- as.list(Word)
names(temp_list) <- Origin
(stem_dict <- dictionary(temp_list))
## Dictionary object with 2 key entries.
## - [kot]:
## - kotek, kotka
## - [pies]:
## - piesek, pieska
Then tokens_lookup()
does its magic.
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"
An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin
vector, which (here, at least) produces the same results:
temp_list <- lapply(unique(Origin), paste0, "*")
names(temp_list) <- unique(Origin)
(stem_dict2 <- dictionary(temp_list))
# Dictionary object with 2 key entries.
# - [kot]:
# - kot*
# - [pies]:
# - pies*
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"