Search code examples
rtext-miningtmquantedatext2vec

A lemmatizing function using a hash dictionary does not work with tm package in R


I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts.

Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code:

library(hashmap)
library(data.table)
txt = 
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"

now I would like to apply it on tm corpus "docs" here is an example syntax I would use with tm package, on tm generated corpus.

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))

another syntax that I tried:

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))

It throws at me an error:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex 

The function works with a vector of texts but it will not work with tm corpus. Thanks in advance for suggestions (even use of this function with other text mining package if it will not work with tm).


Solution

  • I see two problems here. 1) your custom function returns a list, while it should return a vector of strings; and 2) you are passing a wrong lemma_hashmap argument.

    A quick workaround to fix the first problem is to use paste() and sapply() before returning the function result.

    lemma_tokenizer = function(x, lemma_hashmap, 
                               tokenizer = text2vec::word_tokenizer) {
      tokens_list = tokenizer(x)
      for(i in seq_along(tokens_list)) {
        tokens = tokens_list[[i]]
        replacements = lemma_hashmap[[tokens]]
        ind = !is.na(replacements)
        tokens_list[[i]][ind] = replacements[ind]
      }
    
      # paste together, return a vector
      sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
    }
    

    We can run the same example of your post.

    texts = c("Abadanowi abadańczykach OutOfVocabulary", 
              "abadańczyk Abadan OutOfVocabulary")
    lemma_tokenizer(texts, lemma_hm)
    [1] "Abadan abadańczyk OutOfVocabulary" "abadańczyk Abadan OutOfVocabulary"
    

    Now, we can use tm_map. Just make sure to use lemma_hm (i.e., the variable) and not "lemma_hm" (a string) as argument.

    docs <- SimpleCorpus(VectorSource(texts))
    out <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
    out[[1]]$content
    [1] "Abadan abadańczyk OutOfVocabulary"