i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)
library(wordnet)
lapply(corpus.documents,function(x){
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma)
})
but when running this. I have this error:
Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
and those are stack calls:
5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>",
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."),
word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package
="rJava")>), .Names = c("message",
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError", ...
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,
ignoreCase)
3 getTermFilter("ContainsFilter", x, TRUE)
2 FUN(X[[1L]], ...)
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ...
what's wrong?
So this does not address your use of wordnet
, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.
# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
response <- GET(url,query=list(text=text, media="xml",
xmlOutputType="outputPlainXML",
corpusConfig="ncf", # Nineteenth Century Fiction
includeInputText="false", outputReg="true"))
doc <- content(response,type="text/xml")
words <- doc["//adornedWord"]
xmlToDataFrame(doc,nodes=words)
}
library(tm)
vector.documents <- c("Here is some text.",
"This might possibly be some additional text, but then again, maybe not...",
"This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
# token spelling standardSpelling lemmata partsOfSpeech
# 1 Here Here Here here av
# 2 is is is be vbz
# 3 some some some some d
# 4 text text text text n1
# 5 . . . . .
# ...
I'm just showing the lemmatization of the first "document". partsOfSpeech
follows the NUPOS convention.