Search code examples
rloopsglmsamplestatistics-bootstrap

Bootstrapping with glm model


I have a negative binomial regression model where I predict Twitter messages' retweet count based on their use of certain word types (ME words, Moral words, and Emotional words):

M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + Emo_words, data = Tweets)

I now want to sample with bootstrapping (for instance, samples of 1,000 with replacement from the dataframe's original 500,000 messages) from the large dataset, Tweets, to run iterations of the model and analyse the variance of the coefficients. What is the best approach to doing this? I am assuming I need to use the boot package, but I am a bit lost with where to start.

Ideally, I would like to create a for loop that can run a number of iterations, and then store the coefficients of each model iteration in a new dataframe. This would be extremely useful for future analyses.


Here is some reproducible data from the much much large dataframe Tweets:

>dput((head(Tweets, 100)))

structure(list(retweetCount = c(1388, 762, 748, 436, 342, 320, 
312, 295, 264, 251, 196, 190, 175, 167, 165, 163, 149, 148, 148, 
146, 133, 132, 126, 124, 122, 122, 121, 120, 118, 118, 114, 113, 
112, 110, 108, 107, 104, 101, 100, 96, 95, 94, 93, 92, 90, 90, 
89, 89, 87, 86, 84, 83, 83, 83, 82, 82, 82, 82, 78, 78, 78, 76, 
76, 76, 76, 74, 74, 73, 73, 72, 72, 71, 70, 70, 70, 70, 69, 69, 
69, 68, 68, 67, 65, 65, 65, 65, 63, 62, 62, 61, 61, 61, 61, 60, 
60, 59, 59, 59, 59, 58), ME_words = c(2, 2, 2, 0, 0, 1, 1, 0, 
1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 
0, 3, 0, 1, 0, 1, 1, 4, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 
0, 0, 2, 2, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 0, 
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 1, 0, 0, 0, 1, 0, 0), Moral_words = c(0, 0, 1, 1, 1, 2, 0, 
0, 0, 1, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 2, 0, 1, 1, 
1, 0, 1, 1, 1, 1, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 0, 
0, 5, 1, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 
1, 1, 2, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 
1, 1, 0, 0, 2, 2, 1, 0, 0), Emo_words = c(0, 0, 1, 1, 0, 0, 2, 
0, 1, 0, 2, 0, 1, 0, 1, 2, 0, 3, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 
1, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 1, 0, 1, 1, 2, 0, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 0, 2, 0, 0, 1, 0, 
1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 2, 1, 0, 0, 
1, 0, 0, 1, 2, 2, 0, 0, 0)), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • You can use the boot package, but for simple versions of the bootstrap it's almost simpler to roll your own.

    fit initial model

    library(MASS)
    M1 <- glm.nb(retweetCount ~ ME_words + Moral_words + 
                     Emo_words, data = Tweets)
    

    set up data structure for results

    nboot <- 1000
    bres <- matrix(NA,nrow=nboot,
                      ncol=length(coef(M1)),
                      dimnames=list(rep=seq(nboot),
                                    coef=names(coef(M1))))
    

    bootstrap

    set.seed(101)
    bootsize <- 200
    for (i in seq(nboot)) {
      bdat <- Tweets[sample(nrow(Tweets),size=bootsize,replace=TRUE),]
      bfit <- update(M1, data=bdat)  ## refit with new data
      bres[i,] <- coef(bfit)
    }
    

    structure output

    data.frame(mean_est=colMeans(bres),
          t(apply(bres,2,quantile,c(0.025,0.975))))