I have downloaded 10 tweets (later to be enlarged to 1000), I have removed stop words and other usual things (tolower, removeNumbers etc.)
I have created a DocumentTermMatrix
and have calculated the IDF (not TF-IDF) weights for each term and stored them in a matrix.
I need to be able to sort this matrix from high to low, but I need to keep the heading (the term) attached to its respective IDF weight value. This is so I can figure out which term has the highest weight.
My code for computing the IDF is as follows and i have validated that the output is correct by manually calculating some of the values.
n = dim(twitter.matrix)[1]
m = dim(twitter.matrix)[2]
twitter.weight = twitter.matrix
for (d in sequence(n)){
for (t in sequence(m)){
twitter.weight[d,t] = log(n/sum(twitter.matrix[,t] > 0))
}
}
Any help with sorting the matrix to keep the value attached to the term would be greatly appreciated.
twitter.weight =
Terms
Docs amp big case etsi felt great handmad httptcobnoamwqqdl httptcocyntixgf httptcoeifmunvk httptcoeyryptenz httptcogeodppqqn httptcogirhbrqw
[1,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[2,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[3,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[4,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[5,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[6,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[7,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[8,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[9,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
[10,] 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585
SOLVED. It was so simple, all i had to do was use sort() and sort the first row to get the highest values. I have now done it with over 1000 tweets and am getting results as expected. To sort just use:
sorted.tdm = sort(twitter.weight[1,], decreasing=TRUE)