Search code examples
rfrequencycorpusword-frequency

How to calculate most frequent occurring terms/words in a document collection/corpus using R?


First I create a document term matrix like below

dtm <- DocumentTermMatrix(docs)

Then I take the sum of the occurance of each word vectors as below

totalsums <- colSums(as.matrix(dtm))

My totalsums (R says type 'double') looks like below for first 7 elements.

aaab   aabb    aabc   aacc   abbb   abbc    abcc    ...
   9      2      10      4      7      3      12    ...   

I managed to sort this with the following command

sorted.sums <- sort(totalsums, decreasing=T)

Now I want to extract the first 4 terms/words with the highest sums which are greater than value 5. I could get the first 4 highest with sorted.sums[1:4] but how can I set a threshold value?

I managed to do this with the order function like below but, is there a way to do this than sort function or without using findFreqTerms fucntion?

ord.totalsums <- order(totalsums)
findFreqTerms(dtm, lowfreq=5)

Appreciate your thoughts on this.


Solution

  • You can use

    sorted.sums[sorted.sums > 5][1:4]
    

    But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4] should work as well.

    To get the words you can use names.

    names(sorted.sums[sorted.sums > 5][1:4])