Search code examples
rwordnetlemmatization

R error in lemmatizzation a corpus of document with wordnet


i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:

corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)

library(wordnet)
lapply(corpus.documents,function(x){
  x.filter <- getTermFilter("ContainsFilter", x, TRUE)
  terms <- getIndexTerms("NOUN", 1, x.filter)
  sapply(terms, getLemma)
})

but when running this. I have this error:

Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,  :
java.lang.NoSuchMethodError: <init> 

and those are stack calls:

5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>", 
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), 
    word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package 
="rJava")>), .Names = c("message", 
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError",  ... 
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, 
ignoreCase) 
3 getTermFilter("ContainsFilter", x, TRUE) 
2 FUN(X[[1L]], ...) 
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ... 

what's wrong?


Solution

  • So this does not address your use of wordnet, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.

    # MorphAdorner (Northwestern University) web service
    adorn <- function(text) {
      require(httr)
      require(XML)
      url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
      response <- GET(url,query=list(text=text, media="xml", 
                                     xmlOutputType="outputPlainXML",
                                     corpusConfig="ncf", # Nineteenth Century Fiction
                                     includeInputText="false", outputReg="true"))
      doc <- content(response,type="text/xml")
      words <- doc["//adornedWord"]
      xmlToDataFrame(doc,nodes=words)
    }
    
    library(tm)
    vector.documents <- c("Here is some text.", 
                          "This might possibly be some additional text, but then again, maybe not...",
                          "This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
    corpus.documents <- Corpus(VectorSource(vector.documents))
    lapply(corpus.documents,function(x) adorn(as.character(x)))
    # [[1]]
    #   token spelling standardSpelling lemmata partsOfSpeech
    # 1  Here     Here             Here    here            av
    # 2    is       is               is      be           vbz
    # 3  some     some             some    some             d
    # 4  text     text             text    text            n1
    # 5     .        .                .       .             .
    # ...
    

    I'm just showing the lemmatization of the first "document". partsOfSpeech follows the NUPOS convention.