Search code examples
rtext-miningtmadjacency-matrix

R - tm package: Reduce the number of term matrix for the creation of a term-term adjacency visualization


I am having problems to make a reproducible term-term adjacency visualization of my corpus, which has about 800K words.

I am following a tutorial whose term matrix contains just 20 terms, and therefore, the result is optimal:

Term-Term Adjacency Graph

I figure, that my problem is that I am not able to reduce my term matrix to, lets say, the 50 most relevant terms of my corpus. I have found a comment from a site outside SO that could help, but I am not able to adapt it to my needs. In this comment, there is said, that I should have to play with my bounds when I create the Term matrix, so I ended with this code:

dtm2 <- DocumentTermMatrix(ds4.1g, control=list(wordLengths=c(1, Inf), +
bounds=list(global=c(floor(length(ds4.1g)*0.3), floor(length(ds4.1g)*0.6)))))


tdm92.1g2 <- removeSparseTerms(dtm2, 0.99)

tdm2.1g2 <- tdm92.1g2

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g2)
tdm3.1g[tdm3.1g>=1] <- 1 

# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

So, if I am understanding this correctly, I can make the Term-Matrix get only those terms that appears at least in 30% of my documents, but not in more than 60 % of them.

However, no matter how I define this bounds, my term matrix termMatrix.1gram have always 115K elements, whose make impossible a visualization as I want. Is there a way to limit this elements to, let say, just 50 elements?

How do I get my corpus?

Just for clarification, I write down here the code I use to generate my corpus with the tm package:

#specify where is the directory of the files.
folderdir <- paste0(dirname(myFile),"/", project, "/")

#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
ds5.1g   <- tm_map(ds4.1g, content_transformer(removeWords), c("a", "b", "c", "d", "e", "f","g","h","i","j","k","l",
                                                               "m","n","o","p","q","r","s","t","u","v","w","x","y","z"))
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

As you can see, is the traditional way to get it using the tm package. The text are originally saved individually in different txt documents in a folder in my computer.


Solution

  • my problem is that I am not able to reduce my term matrix to, lets say, the 50 most relevant terms

    If "relevancy" means frequency, you could do it like this:

    library(tm)
    data("crude")
    tdm <- TermDocumentMatrix(crude)
    dtm <- DocumentTermMatrix(crude)
    head(as.matrix(tdm))
    tdm <- tdm[names(tail(sort(rowSums(as.matrix(tdm))), 50)), ]
    tdm
    # <<TermDocumentMatrix (terms: 50, documents: 20)>>
    # ...
    dtm <- dtm[, names(tail(sort(colSums(as.matrix(dtm))), 50))]
    inspect(dtm)
    # <<DocumentTermMatrix (documents: 20, terms: 50)>>
    # ...