Search code examples

ngram refer back to docname in quanteda

I am trying to create a datatable similar to the output from quanteda::textstat_frequency but with one more column, docnames, which is a string of doc names that contain the specific token. E.g.

a_corpus <- quanteda::corpus(c("some corpus text of no consequence that in practice is going to be very large",
                                   "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
                                   "adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"))

ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
freq = textstat_frequency(ngrams_dfm)
# freq's header has feature, frequency, rank, docfreq, group

data.table(feature = featnames(ngrams_dfm )[1:50], 
       frequency = colSums(ngrams_dfm)[1:50],
       doc_names = paste(docnames, collapse = ',')?, # what should be here?
       keep.rownames = F,
       stringsAsFactors = F)


  • Another (opinionated) approach could be to use the udpipe R package. Example below - it has the advantage to easily be able to select based on parts-of-speech tags or you could also use it to select specific dependency parsing results which is soo much better than bigrams (but that's for another question)

    txt <- c("some corpus text of no consequence that in practice is going to be very large",
           "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
           "adding some corpus text word repeats to ensure ngrams top ten selection approaches are working")
    x <- udpipe(txt, "english", trace = TRUE) ## rich output, but takes a while for large volumes of text
    x <- setDT(x)
    x <- x[, bigram_lemma := txt_nextgram(lemma, n = 2, sep = "-"), by = list(doc_id, paragraph_id, sentence_id)]
    x <- x[, upos_next := txt_next(upos, n = 1), by = list(doc_id, paragraph_id, sentence_id)]
    x_nouns <- subset(x, upos %in% c("ADJ") & upos_next %in% c("NOUN"))
    freqs <- document_term_frequencies(x, document = "doc_id", term = c("bigram_lemma", "lemma"))
    dtm <- document_term_matrix(freqs)