Search code examples
rnlptext-miningstemmingquanteda

How to replace tokens (words) with stemmed versions of words from my own table?


I got data like this (simplified):

library(quanteda)

sample data

myText <- c("ala ma kotka", "kasia ma pieska")  
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)

tokenization

tokens <- tokens(myDF$myText, what = "word",  
             remove_numbers = TRUE, remove_punct = TRUE,
             remove_symbols = TRUE, remove_hyphens = TRUE)

stemming with my own data sample dictionary

Origin <- c("kot", "pies")
Word <- c("kotek","piesek")

myDict <- data.frame(Origin, Word)

myDict$Origin <- as.character(myDict$Origin)
myDict$Word <- as.character(myDict$Word)

what i got

tokens[1]
[1] "Ala"   "ma"    "kotka"

what i would like to get

tokens[1]
[1] "Ala"   "ma"    "kot"
tokens[2]
[1] "Kasia"   "ma"    "pies"

Solution

  • A similar question has been answered here, but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes.

    Manually mapping stems to inflected forms

    The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. You can then use tokens_lookup() with the exclusive = FALSE, capkeys = FALSE options to convert the inflected terms into their stems.

    Note that I have modified your example a little to simplify it, and to correct what I think were mistakes.

    library("quanteda")
    packageVersion("quanteda")
    [1] ‘0.99.9’
    
    # no need for the data.frame() call
    myText <- c("ala ma kotka", "kasia ma pieska")  
    toks <- tokens(myText, 
                   remove_numbers = TRUE, remove_punct = TRUE,
                   remove_symbols = TRUE, remove_hyphens = TRUE)
    
    Origin <- c("kot", "kot", "pies", "pies")
    Word <- c("kotek", "kotka", "piesek", "pieska")
    

    Then we create the dictionary, as follows. As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. Here, I had to add new values since the inflected forms in your original Word vector were not found in the myText example.

    temp_list <- as.list(Word) 
    names(temp_list) <- Origin
    (stem_dict <- dictionary(temp_list))
    ## Dictionary object with 2 key entries.
    ## - [kot]:
    ##   - kotek, kotka
    ## - [pies]:
    ##   - piesek, pieska    
    

    Then tokens_lookup() does its magic.

    tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
    ## tokens from 2 documents.
    ## text1 :
    ## [1] "ala" "ma"  "kot"
    ## 
    ## text2 :
    ## [1] "kasia" "ma"    "pies" 
    

    Wildcarding all stems from common roots

    An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin vector, which (here, at least) produces the same results:

    temp_list <- lapply(unique(Origin), paste0, "*")
    names(temp_list) <- unique(Origin)
    (stem_dict2 <- dictionary(temp_list))
    # Dictionary object with 2 key entries.
    # - [kot]:
    #   - kot*
    # - [pies]:
    #   - pies*
    
    tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
    ## tokens from 2 documents.
    ## text1 :
    ## [1] "ala" "ma"  "kot"
    ## 
    ## text2 :
    ## [1] "kasia" "ma"    "pies"