My particular corpus contains approx 20k documents and ~9k terms once processed and stemmed.
This is due to the nature of data collection - user submitted online surveys were they tend to leave very short 1 sentence or even 1 or 2 word responses.
If I run kmeans()
on the tdm and then the dtm the results are different if I look at e.g. within cluster sum of squares. I know that a tdm is just a transposed dtm and vice versa.
After discussing both tdm and dtm, this post on r bloggers said:
Which of these proves to be most convenient will depend on the relative number of documents and terms in your data.
With so many terms and documents I found plotting a cusplot very difficult. So I removed some sparcity (.96) which left me with 33 terms. But still a very large number of documents. Presumably most text mining scenarios are the reverse, with a higher number of ters relative to documents.
Based on my description would I run kmeans on the tdm or dtm? I'm seeking to group terms together to aim to find out generalizations about why people are submitting these forms.
Sample code block I have been playing with, what exactly is the difference between kfit and kfit1?
library(tm) # for text mining
## make a example corpus
# make a df of documents a to i
# try making some docs mostly about pets
a <- "dog bunny dog cat hamster"
b <- "cat cat bunny dog hamster"
c <- "cat fish dog"
d <- "cat dog bunny hamster fish"
# try making the remaining docs about fruits
e <- "apple mango orange carrot"
f <- "cabbage apple dog"
g <- "orange mango cat apple"
h <- "apple apple orange"
i <- "apple orange carrot"
j <- c(a,b,c,d,e,f,g,h,i)
x <- data.frame(j)
# turn x into a document term matrix (dtm)
docs <- Corpus(DataframeSource(x))
tdm <- TermDocumentMatrix(docs)
dtm <- DocumentTermMatrix(docs)
# kmeans clustering
set.seed(123)
kfit <- kmeans(tdm, 2)
kfit1 <- kmeans(dtm, 2)
#plot – need library cluster
library(cluster)
clusplot(m, kfit$cluster, color=T, shade=T, labels=2, lines=0)
# t(table(kfit$cluster, 1:dtm$nrow)) for docs based analysis
table(tdm$dimnames$Terms, kfit$cluster) # for term based analysis
Usually, implementations expect instances in rows.
If you want to cluster documents, then documents should be the instances. Running on the transposed matrix will cluster terms, by the documents they appear in.
Similar to computing row averages vs. column averages, they both are mathematically the same, but have a very different semantic. Doing the wrong thing because it is "more convenient" (?!?) sounds like a very bad idea.