Search code examples
rsortingmatrixtf-idf

Sorting a matrix containing Terms and IDF by decreasing value in R


I have downloaded 10 tweets (later to be enlarged to 1000), I have removed stop words and other usual things (tolower, removeNumbers etc.)

I have created a DocumentTermMatrix and have calculated the IDF (not TF-IDF) weights for each term and stored them in a matrix.

I need to be able to sort this matrix from high to low, but I need to keep the heading (the term) attached to its respective IDF weight value. This is so I can figure out which term has the highest weight.

My code for computing the IDF is as follows and i have validated that the output is correct by manually calculating some of the values.

n = dim(twitter.matrix)[1]
m = dim(twitter.matrix)[2]
twitter.weight = twitter.matrix
for (d in sequence(n)){
  for (t in sequence(m)){
    twitter.weight[d,t] = log(n/sum(twitter.matrix[,t] > 0))
  }
}

Any help with sorting the matrix to keep the value attached to the term would be greatly appreciated.

twitter.weight =

              Terms
Docs         amp      big     case     etsi     felt    great  handmad httptcobnoamwqqdl httptcocyntixgf httptcoeifmunvk httptcoeyryptenz httptcogeodppqqn httptcogirhbrqw
   [1,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [2,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [3,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [4,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [5,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [6,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [7,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [8,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
   [9,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585
  [10,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585          2.302585        2.302585        2.302585         2.302585         2.302585        2.302585

Solution

  • SOLVED. It was so simple, all i had to do was use sort() and sort the first row to get the highest values. I have now done it with over 1000 tweets and am getting results as expected. To sort just use:

    sorted.tdm = sort(twitter.weight[1,], decreasing=TRUE)