Search code examples
rregressionplmresample

Bootstrap panel data: randomly selecting individuals, not the person-waves


I am trying to replicate the Stata code (available here) in R. The idea is randomly selecting bootstrap sample from the idcode, not the idcode-year and then estimate the parameters from the selected samples and save them in an Stata datafile. Cluster and idcluster options in the Stata code are used to force Stata to randomly select the bootstrap samples from the idcodes, not the idcode-year.

The below R code does the sampling part per @PBulls suggestion. I wonder how to modify the R code so it replicates the Stata code also pasted below.

Here is the R code:

data <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
  dplyr::filter(!is.na(ttl_exp) & !is.na(hours))

panelboot <- function(df) {
  ids <- unique(df[["idcode"]])
  
  data.frame(
    idcode = sample(ids, replace=TRUE),
    nidcode = seq_along(ids)
  ) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
}

set.seed(1)
boot1 <- panelboot(data)

## ID 1130 is selected 3 times total, contributing 8 observations each time
fid <- which(boot1[,1] == boot1[1,1])

## The resamples of ID 1130 have new IDs 1, 933, 1671
boot1[fid,2]

Here is the Stata code and its output:

use "nslwork.dta", clear 
*you need below for bootstraping from observation not person-years
tsset, clear
generate long newid = idcode // long is not about the data format but it refers to the variable format 
tsset newid year

capture program drop savemargins
program savemargins, rclass
reg ttl_exp hours
end 
bootstrap _b,  saving(boot_output.dta, replace ) reps(10) cluster(idcode) idcluster(newid) : savemargins 

Here is Stata output:

Bootstrap replications (10): .........10 done

Linear regression                                       Number of obs = 28,467
                                                        Replications  =     10
                                                        Wald chi2(1)  = 194.08
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.0118
                                                        Adj R-squared = 0.0118
                                                        Root MSE      = 4.6267

                              (Replications based on 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     ttl_exp | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0512427   .0036782    13.93   0.000     .0440335    .0584519
       _cons |   4.343879   .1302096    33.36   0.000     4.088673    4.599085
------------------------------------------------------------------------------

Solution

  • From the comments, here's code that will perform cluster resampling & perform a "pooled OLS":

    N_SAMPLES <- 100
    
    df <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
      dplyr::filter(!is.na(ttl_exp) & !is.na(hours))
     
    panelboot <- function(df) {
      ids <- unique(df[["idcode"]])
     
      data.frame(
        idcode = sample(ids, replace=TRUE),
        nidcode = seq_along(ids)
      ) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
    }
     
    reg <- function(df, id="nidcode") {
      coef(plm::plm(ttl_exp ~ hours, data = df, model="pooling", index=c(id, "year")))
    }
     
    set.seed(1)
    t0 <- reg(df, id="idcode")
    t <- t(vapply(seq_len(N_SAMPLES), \(d) reg(panelboot(df)), numeric(2)))
    boot_se <- matrixStats::colSds(t)
     
    cbind(t0, boot_se)
    #>               Estimate  SE
    #> (Intercept)   4.343879  0.141799
    #> hours         0.051243  0.003935
     
    t(vapply(seq_along(t0), \(i) boot::norm.ci(t0=t0[i], t=t[,i])[2:3], numeric(2)))
    #>               Normal-based 95% CI
    #> (Intercept)   4.044749  4.627206
    #> hours         0.043783  0.059056
    

    A few pointers to some other questions that came up:

    • The number of bootstrap samples is set by the length of the object in the first vapply call, I've moved this to the top.
    • The bootstrap draws themselves are being saved as the object t (in line with boot's parameter naming). This is an N*P matrix, with N the number of bootstrap samples (rows) and P the number of bootstrapped parameters (columns). I subsequently calculate the standard deviations of each column (matrixStats::colSds). You were previously filling and possibly overwriting this object very differently, in that case you may have to calculate row-wise SDs or something else entirely.