I'm trying to bootstrap a stepwise regression in Stata and extract the bootstrapped coefficients. I have two separate ado files. sw_pbs is the command the user uses, which calls the helper command sw_pbs_simulator.
program define sw_pbs, rclass
syntax varlist, [reps(integer 100)]
simulate _b, reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
local depvar : word 1 of `varlist'
local indepvar : list varlist - depvar
reg `depvar' `indepvar'
local rmse = e(rmse)
matrix b_matrix = e(b)'
gen col_of_ones = 1
mkmat `indepvar' col_of_ones, mat(x_matrix)
gen errs = rnormal(0, `rmse')
mkmat errs, mat(e_matrix)
matrix y = x_matrix * b_matrix + e_matrix
svmat y
sw reg y `indepvar', pr(0.10) pe(0.05)
drop col_of_ones errs y
end
The output is a data set of the bootstrapped coefficients. My problem is that the output seems to be dependent on the result of the first stepwise regression simulation. For example if I had the independent variables var1 var2 var3 var4 and the first stepwise simulation includes only var1 and var2 in the model, then only var1 and var2 will appear in subsequent models. If the first simulation includes var1 var2 and var3 then only var1 var2 and var3 will appear in subsequent models, assuming that they are significant (if not their coefficients will appear as dots).
For example, the incorrect output is featured below. The variables lweight, age, lbph, svi, gleason and pgg45 never appear if they do not appear in the first simulation.
_b_lweight _b_age _b_lbph _b_svi _b_lcp _b_gleason _b_pgg45 _b_lpsa
.4064831 .5390302
.2298697 .5591789
.2829061 .6279869
.5384691 .6027049
.3157105 .5523808
I want coefficients that are not included in the model to always appear as dots in the data set and I want subsequent simulations to not be seemingly dependent on the first simulation.
By using _b
as a short-cut, the first iteration defined which coefficients were to be stored by simulate
in all subsequent iterations. That is fine for most simulation programs, as those would use a fixed set of coefficients, but not what you want to use in combination with sw
. So I adapted the program to explicitly list the coefficients (possibly missing when not selected) that are to be stored.
I also changed your programs such that they will run faster by avoiding mkmat
and svmat
and replacing those computations with predict
and generate
. I also changed it to make it fit more with conventions in the Stata community that a command will only replace a dataset in memory after a user explicitly asks for it by specifying the clear
option. Finally I made sure that names of variables and scalars created in the program do not conflict with names already present in memory by using tempvar
and tempname
. These will also be automatically deleted when the program ends.
clear all
program define sw_pbs, rclass
syntax varlist, clear [reps(integer 100)]
gettoken depvar indepvar : varlist
foreach var of local indepvar {
local res "`res' `var'=r(`var')"
}
simulate `res', reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
tempname rmse b
tempvar yhat y
gettoken depvar indepvar : varlist
reg `depvar' `indepvar'
scalar `rmse' = e(rmse)
predict double `yhat' if e(sample)
gen double `y' = `yhat' + rnormal(0, `rmse')
sw reg `y' `indepvar', pr(0.10) pe(0.05)
// start returning coefficients
matrix `b' = e(b)
local in : colnames `b'
local out : list indepvar - in
foreach var of local in {
return scalar `var' = _b[`var']
}
foreach var of local out {
return scalar `var' = .
}
end