I am trying to use stemCompletion to convert the stemmed words into complete words.
Following is the code I am using
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus.stemmed <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.unstemmed <- tm_map(myCorpus.stemmed, stemCompletion, dictionary=myCorpusCopy)
if I check the first element for stemmed corpus, it shows me the element correctly
myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"
But if I check the first element of unstemmed corpus, it throws out junk
myCorpus.unstemmed[[1]][1]
$content
[1] NA
Why is the unstemmed corpus not showing the right content?
Why is the unstemmed corpus not showing the right content?
Since you got a simple corpus object, you are effectively calling
stemCompletion(
x = c("once we have a corpus we typically want to modify the documents in it",
"eg stemming stopword removal et cetera",
"in tm all this functionality is subsumed into the concept of a transformation"),
dictionary=myCorpusCopy
)
which yields
# once we have a corpus we typically want to modify the documents in it
# NA
# eg stemming stopword removal et cetera
# NA
# in tm all this functionality is subsumed into the concept of a transformation
# NA
due to stemCompletion
awaiting a character vector of stems as a first argument (c("once", "we", "have")
), not a character vector of stemmed texts (c("once we have")
).
If you want to complete the stems in your corpus, whatever this is supposed to be good for, you have to pass a character vector of single stems to stemCompletion
(i.e. tokenize each text document, stem-complete the stems, then paste them together again).