I am trying to create a datatable similar to the output from quanteda::textstat_frequency
but with one more column, docnames
, which is a string of doc names that contain the specific token.
E.g.
a_corpus <- quanteda::corpus(c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"))
ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
freq = textstat_frequency(ngrams_dfm)
# freq's header has feature, frequency, rank, docfreq, group
data.table(feature = featnames(ngrams_dfm )[1:50],
frequency = colSums(ngrams_dfm)[1:50],
doc_names = paste(docnames, collapse = ',')?, # what should be here?
keep.rownames = F,
stringsAsFactors = F)
Another (opinionated) approach could be to use the udpipe R package. Example below - it has the advantage to easily be able to select based on parts-of-speech tags or you could also use it to select specific dependency parsing results which is soo much better than bigrams (but that's for another question)
library(udpipe)
library(data.table)
txt <- c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working")
x <- udpipe(txt, "english", trace = TRUE) ## rich output, but takes a while for large volumes of text
x <- setDT(x)
x <- x[, bigram_lemma := txt_nextgram(lemma, n = 2, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
x <- x[, upos_next := txt_next(upos, n = 1), by = list(doc_id, paragraph_id, sentence_id)]
x_nouns <- subset(x, upos %in% c("ADJ") & upos_next %in% c("NOUN"))
View(x)
freqs <- document_term_frequencies(x, document = "doc_id", term = c("bigram_lemma", "lemma"))
dtm <- document_term_matrix(freqs)