Search code examples
rmatrixdataframequanteda

Document-Term Matrix with Quanteda


I have a dataframe df with this structure :

Rank Review
5    good film
8    very goood film
..

Then I tried to create a DocumentTermMatris using quanteda package :

temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens
+   dfm %>% # generate dfm
+   convert(to = "tm") 

I get this matrix :

> inspect(temp.tf)
<<DocumentTermMatrix (documents: 63023, terms: 23892)>>
Non-/sparse entries: 520634/1505224882
Sparsity           : 100%
Maximal term length: 77
Weighting          : term frequency (tf)
Sample             :

Whith this structure :

           Terms
Docs        good very film my excellent heart David plus always so
  text14670 1       0      0      0   1          0      0    0        2    0
  text19951 3       0      0      0   0          0      0    1        1    1
  text24305 7       0      2      1   0          0      0    2        0    0
  text26985 6       0      0      0   0          0      0    4        0    1
  text29518 4       0      1      0   1          0      0    3        0    1
  text34547 5       2      0      0   0          0      2    3        1    3
  text3781  3       0      1      4   0          0      0    3        0    0
  text5272  4       0      0      4   0          5      0    3        1    2
  text5367  3       0      1      3   0          0      1    4        0    1
  text6001  3       0      9      1   0          6      0    1        0    1

So I think It is good , but I think that : text6001 , text5367, text5272 ... refer to document's name... My question is that rows in this matrix are ordered? or randoms putted in the matrix?

Thank you

EDIT :

I created a document term frequency :

mydfm <- dfm(df$Review, remove = stopwords("french"), stem = TRUE)

Then, I created a tf-idf matrix :

tfidf <- tfidf(mydfm)[, 5:10]

Then I would like to merge the tfidf matrix to the Rank column to have something like this

         features
Docs        good   very   film   my excellent heart    David plus  always so Rank
  text14670 1       0      0      0   1          0      0    0        2    0 3
  text19951 3       0      0      0   0          0      0    1        1    1 2
  text24305 7       0      2      1   0          0      0    2        0    0 4
  text26985 6       0      0      0   0          0      0    4        0    1 5

Can you help to make this merge?

Thank you


Solution

  • The rows (documents) are alphabetically ordered, which is why text14670 comes before text19951. It is possible that the conversion has reordered the documents, but you can test this using

    sum(rownames(temp.tf) == sort(rownames(temp.tf))
    

    If that is not 0, then they are not alphabetically ordered.

    The feature ordering, at least in the quanteda dfm, come from the order in which they are found in the texts. You can resort both using dfm_sort().

    In your code, the tokens(ngrams = 1:1) is unnecessary since dfm() does that and ngrams = 1 is the default.

    Also, do you need to convert this to a tm object? Probably most of what you need can be done in quanteda.