I am having problems to make a reproducible term-term adjacency visualization of my corpus, which has about 800K words.
I am following a tutorial whose term matrix contains just 20 terms, and therefore, the result is optimal:
I figure, that my problem is that I am not able to reduce my term matrix to, lets say, the 50 most relevant terms of my corpus. I have found a comment from a site outside SO that could help, but I am not able to adapt it to my needs. In this comment, there is said, that I should have to play with my bounds when I create the Term matrix, so I ended with this code:
dtm2 <- DocumentTermMatrix(ds4.1g, control=list(wordLengths=c(1, Inf), +
bounds=list(global=c(floor(length(ds4.1g)*0.3), floor(length(ds4.1g)*0.6)))))
tdm92.1g2 <- removeSparseTerms(dtm2, 0.99)
tdm2.1g2 <- tdm92.1g2
# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g2)
tdm3.1g[tdm3.1g>=1] <- 1
# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)
So, if I am understanding this correctly, I can make the Term-Matrix get only those terms that appears at least in 30% of my documents, but not in more than 60 % of them.
However, no matter how I define this bounds, my term matrix termMatrix.1gram
have always 115K elements, whose make impossible a visualization as I want. Is there a way to limit this elements to, let say, just 50 elements?
How do I get my corpus?
Just for clarification, I write down here the code I use to generate my corpus with the tm package:
#specify where is the directory of the files.
folderdir <- paste0(dirname(myFile),"/", project, "/")
#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
ds5.1g <- tm_map(ds4.1g, content_transformer(removeWords), c("a", "b", "c", "d", "e", "f","g","h","i","j","k","l",
"m","n","o","p","q","r","s","t","u","v","w","x","y","z"))
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)
tdm2.1g <- tdm92.1g
As you can see, is the traditional way to get it using the tm package. The text are originally saved individually in different txt documents in a folder in my computer.
my problem is that I am not able to reduce my term matrix to, lets say, the 50 most relevant terms
If "relevancy" means frequency, you could do it like this:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
dtm <- DocumentTermMatrix(crude)
head(as.matrix(tdm))
tdm <- tdm[names(tail(sort(rowSums(as.matrix(tdm))), 50)), ]
tdm
# <<TermDocumentMatrix (terms: 50, documents: 20)>>
# ...
dtm <- dtm[, names(tail(sort(colSums(as.matrix(dtm))), 50))]
inspect(dtm)
# <<DocumentTermMatrix (documents: 20, terms: 50)>>
# ...