I am using the following tm+RWeka code to extract the most frequent ngrams in texts:
library("RWeka")
library("tm")
text <- c('I am good person','I am bad person','You are great','You are more great','todo learn english','He is ok')
BigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
corpus <- Corpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,control = list(tokenize = BigramTokenizer))
DF <- data.frame(inspect(tdm))
DF$sums <- DF$X1+DF$X2+DF$X3+DF$X4+DF$X5+DF$X6
MostFreqNgrams <- rownames(head(DF[with(DF,order(-sums)),]))
It is working ok, but what if the data is way bigger? Is there a computational more efficient way? Furthermore, if the variables are more(ex. 100) how can i write the DF$sums
code line. For sure there is something more elegant than the followin:
DF$sums <- DF$X1+DF$X2+DF$X3+DF$X4+DF$X5+DF$X6+...+DF$X99+DF$X100
Thank you
EDIT: I am wondering if there is a way to extract the most frequent ngrams from tdm
TermDocumentMatrix and after create a dataframe with the values. What I am doing is to create a dataframe with all the ngrams and after take the most frequent values which seems not to be the best choice.
Based on your edit you could use the following:
my_matrix <- as.matrix(tdm[findFreqTerms(tdm, lowfreq = 2),])
DF <- data.frame(my_matrix, sums = rowSums(my_matrix))
DF
X1 X2 X3 X4 X5 X6 sums
i am 1 1 0 0 0 0 2
you are 0 0 1 1 0 0 2