Search code examples
rtext-miningtmquantedatext2vec

Lemmatization using txt file with lemmes in R


I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)

Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczyka
abadańczyk  abadańczykach
abadańczyk  abadańczykami
abadańczyk  abadańczyki
abadańczyk  abadańczykiem
abadańczyk  abadańczykom
abadańczyk  abadańczyków
abadańczyk  abadańczykowi
abadańczyk  abadańczyku
abadanka    abadance
abadanka    abadanek
abadanka    abadanką
abadanka    abadankach
abadanka    abadankami

What packages and with what syntax, would allow me use such txt database to lemmatize my bag of words. I realize, for English there is Wordnet, but there is no luck for those who would like to use this functionality for rare languages.

If not, can this database be converted to be useful with any package that provides lemmatization? Perhaps by converting it to a wide form? For instance, the form used by free AntConc concordancer, (http://www.laurenceanthony.net/software/antconc/)

Abadan -> Abadanem, Abadanie, Abadanowi, Abadanu
abadańczyk -> abadańczycy, abadańczyka, abadańczykach 
etc.

In brief: How can lemmatization with lemmas in txt file be done in any of the known CRAN R text mining packages ? If so, how to format such txt file?

UPDATE: Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs"

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")) 

and I tried it as tokenizer

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer)) 

It throws at me an error:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex 

The function works with a vector of texts as charm though.


Solution

  • My guess is that here is nothing to do with text-mining packages for this task. You need just to replace word in a second column by word in a first column. You can do it with creating hashmap (for example https://github.com/nathan-russell/hashmap).

    Below is example of how you can create "lemmatizing" tokenizer which you can easily use in text2vec (and I guess quanteda as well).

    Contributions in order to create such "lemmatizing" package are very welcome - will be very useful.

    library(hashmap)
    library(data.table)
    txt = 
      "Abadan  Abadanem
      Abadan  Abadanie
      Abadan  Abadanowi
      Abadan  Abadanu
      abadańczyk  abadańczycy
      abadańczyk  abadańczykach
      abadańczyk  abadańczykami
      "
    dt = fread(txt, header = F, col.names = c("lemma", "word"))
    lemma_hm = hashmap(dt$word, dt$lemma)
    
    lemma_hm[["Abadanu"]]
    #"Abadan"
    
    
    lemma_tokenizer = function(x, lemma_hashmap, 
                               tokenizer = text2vec::word_tokenizer) {
      tokens_list = tokenizer(x)
      for(i in seq_along(tokens_list)) {
        tokens = tokens_list[[i]]
        replacements = lemma_hashmap[[tokens]]
        ind = !is.na(replacements)
        tokens_list[[i]][ind] = replacements[ind]
      }
      tokens_list
    }
    texts = c("Abadanowi abadańczykach OutOfVocabulary", 
              "abadańczyk Abadan OutOfVocabulary")
    lemma_tokenizer(texts, lemma_hm)
    
    #[[1]]
    #[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
    #[[2]]
    #[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"