R: is it possible to calculate word burstiness with quanteda or any other text mining R package?

We are using burstiness for terminology/lexicon induction from text corpora.

We have currently implemented a R script based on one of the formulas of Burstiness Similarity described in Section 2.6 of the following article: Ann Irvine and Chris Callison-Burch (2017). A Comprehensive Analysis of Bilingual Lexicon Induction. Computational Linguistics Volume 43 | Issue 2 | June 2017 p.273-310. https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00284

As far as I know, Katz was one of the first scientists who used the concept of burstiness for language modelling (see Justeson, J. S. and Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1:9–27; Katz, S. (1996). Distribution of content words and phrases in text and language modelling.Natural Language Engineering, 2(1):15–60.)

We would like to use off-the-shelf burstiness implementations for comparison and for the evaluation of our script.

I would like to know whether there exist R packages or R functions that identify bursty words in text corpora. I would be particularly interested in any solutions based or leveraging on Quanteda, since Quanteda an extremely versatile package for text statistics.

The only R package that I found so far is Package ‘bursts’ (February 19, 2015), which implements Kleinberg's burstiness. Kleinberg’s burst detection algorithm "identifies time periods in which a target event is uncharacteristically frequent, or “bursty.” This is not what I need since this approach is based on time series.

Help, suggestions, references are appreciated.

Cheers, Marina

Solution

I haven't found many public references about burstiness related to text analyses. I did come across Modeling Statistical Properties of Written Text.

If I'm reading the formula correctly in section 2.6 from the article you supplied, then it is the relative proportion of the words divided by the percentage of the documents in which the words appear.

I had hoped that using the dfm_tfidf function would get me there. But the scheme_df part of the function does not have a proportional document frequency option.

But we can use parts of quanteda's existing functions to put everything together.

Lets assume we having a document-feature matrix (dfm) called "docfm". Then the steps are like this

the relative proportion of the terms can be calculated by dfm_weight(docfm, scheme = "prop")
Getting the proportional document frequency is docfreq(docfm) / ndocs(docfm).

Now some matrix division calculations. Either apply or sweep will work. apply will return a matrix and needs to be transposed, sweep will return a dgeMatrix. In both cases you can turn them back into a dmf with as.dfm. Unfortunately both are dense matrices, so you might need to take this into account. Putting it all together:

Using apply:

t(apply(X = dfm_weight(docfm, scheme = "prop"), 1, "/",  (docfreq(docfm) / ndoc(docfm))))

Using sweep:

sweep(dfm_weight(docfm, scheme = "prop"), MARGIN = 2, STATS = docfreq(docfm) / ndoc(docfm), FUN = "/")