I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights
argument in svyglm
to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
rescale = F
(to apparently stop weights being rescaled to the sum of the sample size) doesn't change anythingmod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
the psfmi
package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign
you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey
, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm
with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine
, which is set up to work with the survey package (see the help for with.svyimputationList
and withPV
).
It might be worth putting in a feature request to the mitools
or survey
developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.