Search code examples
rdictionarytexttokenquanteda

Quanteda: Fastest way to replace tokens with lemma from dictionary?


Is there a much faster alternative to R quanteda::tokens_lookup()?

I use tokens() in the 'quanteda' R package to tokenize a data frame with 2000 documents. Each document is 50 - 600 words. This takes a couple of seconds on my PC (Microsoft R Open 3.4.1, Intel MKL (using 2 cores)).

I have a dictionary object, made from a data frame of nearly 600 000 words (TERMS) and their corresponding lemma (PARENT). There are 80 000 distinct lemmas.

I use tokens_lookup() to replace the elements in the token-list by their lemmas found in the dictionary. But this takes at least 1,5 hours. This function is TOO slow for my problem. Is there a quicker way, while still getting a token list?

I want to transform the token list directly, to be make ngrams AFTER using the dictionary. If I only wanted onegrams I could easily have done this by joining the document-feature matrix with the dictionary.

How can I do this faster? Convert token list to data frame, join with dictionary, convert back to ordered token list?

Here is the sample code:

library(quanteda)
myText <- c("the man runs home", "our men ran to work")
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)

tokens <- tokens(myDF$myText, what = "word",
                 remove_numbers = TRUE, remove_punct = TRUE,
                 remove_symbols = TRUE, remove_hyphens = TRUE)
tokens
# tokens from 2 documents.
# text1 :
#   [1] "the"  "man"  "runs" "home"
# 
# text2 :
#   [1] "our"  "men"  "ran"  "to"   "work"

term <- c("man", "men", "woman", "women", "run", "runs", "ran")
lemma <- c("human", "human", "human", "humen", "run", "run", "run")
dict_df <- data.frame(TERM=term, LEMMA=lemma)
dict_df
# TERM LEMMA
# 1   man human
# 2   men human
# 3 woman human
# 4 women humen
# 5   run   run
# 6  runs   run
# 7   ran   run

dict_list <- list( "human" = c("man", "men", "woman", "women") , "run" = c("run", "runs", "ran"))
dict <- quanteda::dictionary(dict_list)
dict
# Dictionary object with 2 key entries.
# - human:
#   - man, men, woman, women
# - run:
#   - run, runs, ran

tokens_lemma <- tokens_lookup(tokens, dictionary=dict, exclusive = FALSE, capkeys = FALSE) 
tokens_lemma
#tokens from 2 documents.
# text1 :
#   [1] "the"   "human" "run"   "home" 
# 
# text2 :
#   [1] "our"   "human" "run"   "to"    "work"

tokens_ngrams <- tokens_ngrams(tokens_lemma, n = 1:2)
tokens_ngrams
#tokens from 2 documents.
# text1 :
#   [1] "the"       "human"     "run"       "home"      "the_human" "human_run" "run_home" 
# 
# text2 :
#   [1] "our"       "human"     "run"       "to"        "work"      "our_human" "human_run" "run_to"    "to_work" 

Solution

  • I don't have a lemma list to benchmark myself, but this is the fastest way to covert token types. Please try and let me know how long it takes (should be done in a few seconds).

    tokens_convert <- function(x, from, to) {
        type <- attr(x, 'types')
        type_new <- to[match(type, from)]
        type_new <- ifelse(is.na(type_new), type, type_new)
        attr(x, 'types') <- type_new
        quanteda:::tokens_recompile(x)
    }
    
    tokens_convert(tokens, dict_df$TERM, dict_df$LEMMA)