Search code examples
rtext-miningtmterm-document-matrixrweka

R Text Mining - Converting Term Document Matrix


I created a list of bigrams using:

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm_a.bigram = TermDocumentMatrix(docs_a,
                                control = list(tokenize = BigramTokenizer))

I am trying to get a count of documents each bigram is appearing in. If I understand correctly Term Document Matrix will give how many times each bigram occurs within a document. But I just need '1'-present in a document and '0'-not there.

How do I convert Term Document Matrix into dataframe or matrix to be able to get such count?


Solution

  • A TDM is a simple_triplet_matrix from the slam package. Which has some fucntions for common operations line row/colSums.

    slam::row_sums(tdm_a.bigram>=1)

    This should tell you how many documents contained each bigram.