I am trying to visualize the number of tokens per document from a document-feature-matrix (dfm).
It is easy to do when the tokens are first created from txt files, I can simply visualize the tokens per document from the table created in the Data environment, for instance under the column type, I can clearly see the tokens for each document.
However, after having tokenized the documents I created the dfm and I used the function dfm_trim() and the argument 'min_termfreq' to select only the tokens that appear at least 15 times across all the documents in the dfm, consequently the number of tokens per document diminished.
I cannot figure out how to visualize the new values, could you please help me out?
########################### Code example ##########################
# create the corpus
PI3_CORPUS <- corpus(PI3)
# create the tokens
PI3_TOKENS <- tokens(PI3_CORPUS, remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(stopwords ("en")) %>%
tokens_wordstem()
# create the document feature matrix
PI3_DFM <- dfm(PI3_TOKENS) %>%
dfm_trim(min_termfreq = 15)
# I would like to see the number of tokens per document from the dfm
######################## End code example ##########################
I tried to use both the functions ntoken() and ntype() which work but they are too 'untidy' as the visualization in the console is not clear.
data.frame(doc_id = docnames(PI3_DFM), ntoken = ntoken(PI3_DFM),
row.names = NULL)