I would like to read a txt file and do some text mining approaches. When I used the tm package in R, I got lots of error messages. For example, If I wanted to correlate the most frequent words, I got only NA's. Here is the code, I have used so far:
library(tm)
doc <- c("word1 word1 word2 word1 word2 word3 word1 word2 word3 word4 word1 word2 word3 word4 word5")
Corpus <- Corpus(VectorSource(doc))
Corpus <- tm_map(Corpus, stripWhitespace)
Corpus <- tm_map(Corpus, tolower)
Corpus <- tm_map(Corpus, removeWords, stopwords("english"))
Corpus <- tm_map(Corpus, removePunctuation)
tdm <- TermDocumentMatrix(Corpus)
#Plotting correlation of Terms
plot(tdm, terms = findFreqTerms(tdm, lowfreq = 2, Inf)[1:3], CorThreshold = 0.1)
After that, I got the following error message:
Error in if (all(from == t(from))) "undirected" else "directed":
missing value where TRUE/FALSE needed
O.k. for investigations, I used the following code which is a step-by-step approach of findAssocs():
terms <- findFreqTerms(tdm, lowfreq = 2)[1:3]
m <- as.matrix(t(tdm[terms,]))
m
cor(m)
However, I got the following output:
word1 word2 word3
word1 NA NA NA
word2 NA NA NA
word3 NA NA NA
From my point of view, there is something wrong with the text, but I have no explanation for this strange behavior. My questions is, if somebody has got a solution for this problem. My R (2.15.2) is running on a Mac system (x86_64-apple-darwin9.8.0/x86_64 (64-bit)).
Thanks a lot!
For the correlation analysis function cor()
you got the matrix of NA values because you have only one observation of each variable - you can't do correlation if variables has only one observation.
You can check it by looking on the your matrix m
> m
Terms
Docs word1 word2 word3
1 5 4 3