Bootstrapping Stepwise Regression in Stata

I'm trying to bootstrap a stepwise regression in Stata and extract the bootstrapped coefficients. I have two separate ado files. sw_pbs is the command the user uses, which calls the helper command sw_pbs_simulator.

program define sw_pbs, rclass
    syntax varlist, [reps(integer 100)]
    simulate _b, reps(`reps') : sw_pbs_simulator `varlist'
end


program define sw_pbs_simulator, rclass
    syntax varlist

    local depvar : word 1 of `varlist'
    local indepvar : list  varlist - depvar
    reg `depvar' `indepvar'
    local rmse = e(rmse)
    matrix b_matrix = e(b)'
    gen col_of_ones = 1
    mkmat `indepvar' col_of_ones, mat(x_matrix)
    gen errs = rnormal(0, `rmse')
    mkmat errs, mat(e_matrix)
    matrix y = x_matrix * b_matrix + e_matrix
    svmat y
    sw reg y `indepvar', pr(0.10) pe(0.05)
    drop col_of_ones errs y

end

The output is a data set of the bootstrapped coefficients. My problem is that the output seems to be dependent on the result of the first stepwise regression simulation. For example if I had the independent variables var1 var2 var3 var4 and the first stepwise simulation includes only var1 and var2 in the model, then only var1 and var2 will appear in subsequent models. If the first simulation includes var1 var2 and var3 then only var1 var2 and var3 will appear in subsequent models, assuming that they are significant (if not their coefficients will appear as dots).

For example, the incorrect output is featured below. The variables lweight, age, lbph, svi, gleason and pgg45 never appear if they do not appear in the first simulation.

_b_lweight  _b_age  _b_lbph _b_svi  _b_lcp  _b_gleason  _b_pgg45    _b_lpsa
                       .4064831                        .5390302
                       .2298697                        .5591789
                       .2829061                        .6279869
                       .5384691                        .6027049
                       .3157105                        .5523808

I want coefficients that are not included in the model to always appear as dots in the data set and I want subsequent simulations to not be seemingly dependent on the first simulation.

Solution

By using _b as a short-cut, the first iteration defined which coefficients were to be stored by simulate in all subsequent iterations. That is fine for most simulation programs, as those would use a fixed set of coefficients, but not what you want to use in combination with sw. So I adapted the program to explicitly list the coefficients (possibly missing when not selected) that are to be stored.

I also changed your programs such that they will run faster by avoiding mkmat and svmat and replacing those computations with predict and generate. I also changed it to make it fit more with conventions in the Stata community that a command will only replace a dataset in memory after a user explicitly asks for it by specifying the clear option. Finally I made sure that names of variables and scalars created in the program do not conflict with names already present in memory by using tempvar and tempname. These will also be automatically deleted when the program ends.

clear all
program define sw_pbs, rclass
    syntax varlist, clear [reps(integer 100)]

    gettoken depvar indepvar : varlist
    foreach var of local indepvar {
        local res "`res' `var'=r(`var')"
    }

    simulate `res', reps(`reps') : sw_pbs_simulator `varlist'
end

program define sw_pbs_simulator, rclass
    syntax varlist
    tempname rmse b
    tempvar yhat y
    gettoken depvar indepvar : varlist
    reg `depvar' `indepvar'
    scalar `rmse' = e(rmse)
    predict double `yhat' if e(sample) 
    gen double `y' = `yhat' +  rnormal(0, `rmse') 
    sw reg `y' `indepvar', pr(0.10) pe(0.05)

    // start returning coefficients
    matrix `b' = e(b)
    local in : colnames `b'
    local out : list indepvar - in
    foreach var of local in {
        return scalar `var' = _b[`var']
    }
    foreach var of local out {
        return scalar `var' = .
    }
end