Search code examples
rtokenquantedadfm

Tokens' frequency of a document-feature-matrix (dfm)


I am trying to visualize the number of tokens per document from a document-feature-matrix (dfm).

It is easy to do when the tokens are first created from txt files, I can simply visualize the tokens per document from the table created in the Data environment, for instance under the column type, I can clearly see the tokens for each document.

However, after having tokenized the documents I created the dfm and I used the function dfm_trim() and the argument 'min_termfreq' to select only the tokens that appear at least 15 times across all the documents in the dfm, consequently the number of tokens per document diminished.

I cannot figure out how to visualize the new values, could you please help me out?

########################### Code example ##########################

# create the corpus
PI3_CORPUS <- corpus(PI3)

# create the tokens
PI3_TOKENS <- tokens(PI3_CORPUS, remove_punct = TRUE, 
                     remove_numbers = TRUE, 
                     remove_symbols = TRUE) %>%
  tokens_remove(stopwords ("en")) %>%
  tokens_wordstem()

# create the document feature matrix
PI3_DFM <- dfm(PI3_TOKENS) %>%
  dfm_trim(min_termfreq = 15)

# I would like to see the number of tokens per document from the dfm

######################## End code example ##########################

I tried to use both the functions ntoken() and ntype() which work but they are too 'untidy' as the visualization in the console is not clear.


Solution

  • data.frame(doc_id = docnames(PI3_DFM), ntoken = ntoken(PI3_DFM),
               row.names = NULL)