I am trying to use stemCompletion to convert the stemmed words into complete words.
Following is the code I am using
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus.stemmed <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.unstemmed <- tm_map(myCorpus.stemmed, stemCompletion, dictionary=myCorpusCopy)
if I check the first element for stemmed corpus, it shows me the element correctly
[1] "onc we have a corpus we typic want to modifi the document in it"
But if I check the first element of unstemmed corpus, it throws out junk
[1] NA
Why is the unstemmed corpus not showing the right content?
Since you got a simple corpus object, you are effectively calling
x = c("once we have a corpus we typically want to modify the documents in it",
"eg stemming stopword removal et cetera",
"in tm all this functionality is subsumed into the concept of a transformation"),
which yields
# once we have a corpus we typically want to modify the documents in it
# NA
# eg stemming stopword removal et cetera
# NA
# in tm all this functionality is subsumed into the concept of a transformation
# NA
due to stemCompletion
awaiting a character vector of stems as a first argument (c("once", "we", "have")
), not a character vector of stemmed texts (c("once we have")
If you want to complete the stems in your corpus, whatever this is supposed to be good for, you have to pass a character vector of single stems to stemCompletion
(i.e. tokenize each text document, stem-complete the stems, then paste them together again).