I've a term-document sparse matrix made iusing the tm package in R
I can convert to a term-term matrix using this snippet of code:
library("tm")
data(crude)
couple.of.words <- c("embargo", "energy", "oil", "environment", "estimate")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = couple.of.words))
tdm.matrix <- as.matrix(tdm)
tdm.matrix[tdm.matrix>=1] <- 1
tdm.matrix <- tdm.matrix %*% t(tdm.matrix)
but it's not what I really need, since I have to build a data frame suitable to be loaded in a network analysis tool like Gephi. This data frame should ideally have three columns:
{term1, term2, number of docs where term1 and term2 co-occur}
For example (not from the real data provided in the example above) if the word "embargo" and "energy" co-occur in three documents (this can be seen in the tdm matrix, where each document fits a column), i have a row like that:
+-----------+-------------+------+
| term1 | term 2 | Freq |
+-----------+-------------+------+
| oil | energy | 3 |
+-----------+-------------+------+
how can I build this nodes/edges dataframe from the term-document or the term-term matrix?
Sounds like you can get what you need if you add one more line of code
desired <- as.data.frame(as.table(tdm.matrix))
head(desired)
# Terms Terms.1 Freq
# 1 embargo embargo 8
# 2 energy embargo 6
# 3 environment embargo 2
# 4 estimate embargo 4
# 5 oil embargo 44
# 6 embargo energy 6
The as.table()
really only changes the class. And it just so happens that there is an existing as.data.frame.table()
method that flattens tables into frequency listings like you desire.