Search code examples
rnlptmquantedastatistics-bootstrap

How can I bootstrap text readability statistics using quanteda?


I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the following function:

textstat_readability(texts(mwe, groups = "document"), "Flesch")

I then want to bootstrap the results to obtain a 95% confidence interval by wrapping a function:

b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure) 
n <- 10

groups <- factor(mwe[["document"]]$document)  
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups) 
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975)) 

But "b <-" fails with the error: "Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on matrix"

I've wasted two days trying to debug with no luck. What am I doing wrong? Much appreciated for any advice...

MWE:

mwe<-structure(list(document = structure(c(1L, 1L), 
.Label = c("a", "b", "c", "d", "e"), class = "factor"),  text = c("Text 1. Text 1.1", "Text 2."), section = structure(2:1, .Label = c("aa", "bb", "cc", "dd", "ee", "ff", "hh", "ii", "jj", "kk"), class = "factor"), year = c(1919L, 1944L), preamble = structure(8:9, .Label = c("1", "2","3", " 4 ", "5", "6  ",  "7  ",  "8  ", "9  ",  "10 "), class = "factor"), articles = c(43L, 47L), pages = c(5.218, 7.666), wordcount = c(3503L, 4929L), mean_articles = c(45, 45)), row.names = 1:2, class = "data.frame")

mwe <- corpus(mwe)

b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure) 
n <- 10

groups <- factor(mwe[["document"]]$document)  
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups) 
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975)) 

Solution

  • A good question that involves knowing a lot about the boot package as well as how to index and group corpus texts in quanteda. Here's the best (currently) and safest way to do it. "Safest" here means future-proof, since there are some things that currently work in the internal addressing of a quanteda corpus that will not work in upcoming v2. (We warn about this very clearly in ?corpus but no one seems to heed that warning...) Note also that while this should always work, we are also planning, in future versions, more direct methods for bootstrapping textual statistics that would not require the user to do this sort of deep dive into the boot package.

    Let's try a reproducible example from built-in objects first. To "bootstrap" a text, we will construct a new, hypothetical text using sentence-level resampling (with replacement) from the original, and use texts(x, groups = "<groupvar>") to piece this together into a hypothetical kind of text. (This is how I have done in in the two references at the end of this post.) To make this happen, we can exploit the property of texts() that it works to get texts from a corpus object but also works on character objects (but with fast grouping).

    To get the sentences, after subsetting the corpus to simplify our example here, we reshape it into sentences.

    First, however, I recorded the original document's name in a new document variable, so that we can use this for grouping later. In this example, we could also have used Year, but doing it this way will work for any example. (There are some internal records about the original docname that we might have used, but doing it this way will be future-proof.)

    library("quanteda")
    ## Package version: 1.4.1
    library("boot")
    
    docvars(data_corpus_inaugural, "docnameorig") <- docnames(data_corpus_inaugural)
    sent_corpus <- data_corpus_inaugural %>%
      corpus_subset(Year > 2000) %>%
      corpus_reshape(to = "sentences")
    

    Then we have to define the function to be bootstrapped. We will use the "index" method and call the index i (as you did above). Here, x will be a character and not a corpus, even though we will call texts() on it, again, using the grouping variable to reassemble it. This will also need to return a vector and not a data.frame, which is normal form of a textstat_*() return. So we will extract just the measure column and return it as a vector.

    b_readability <- function(x, i, groups = NULL, measure = "Flesch") {
      textstat_readability(texts(x[i], groups = groups[i]), measure)[[measure]]
    }
    

    We will call our grouping variable simgroups just to distinguish the value from the argument name, and use this for both the groups argument and for strata in the call to boot(). The strata is an argument to boot(), while groups is passed through to our function b_readability(). We need to factorize this grouping variable since the function seems to want that. Then we call boot() and get our answer.

    simgroups <- factor(docvars(sent_corpus, "docnameorig"))
    
    boot(texts(sent_corpus), b_readability, R = 10, 
         strata = simgroups, groups = simgroups)
    ## 
    ## STRATIFIED BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = texts(sent_corpus), statistic = b_readability, R = 10, 
    ##     strata = simgroups, groups = simgroups)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##     original      bias    std. error
    ## t1* 60.22723 -0.01454477    2.457416
    ## t2* 53.23332  1.24942328    2.564719
    ## t3* 60.56705  1.07426297    1.996705
    ## t4* 53.55532 -0.28971190    1.943986
    ## t5* 58.63471  0.52289051    2.502101
    

    These correspond to the five (original) documents, here distinguished by year, although unfortunately those names have been replaced by t1, t2, ... in the return object from boot().

    To return to your original example, let's say these form two documents from one strata (since these are too short two subdivide further). Then:

    simgroups <- rep(1, ndoc(mwe))
    boot(texts(mwe), b_readability, R = 10, strata = simgroups, groups = simgroups)
    ## 
    ## STRATIFIED BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = texts(mwe), statistic = b_readability, R = 10, strata = simgroups, 
    ##     groups = simgroups)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##     original    bias    std. error
    ## t1*   119.19 0.6428333   0.4902916