Search code examples
glmsurveyimputationweighted

Use the Survey package to weight observations in stacked imputations


I am exploring model variable selection within imputed data.

One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.

I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.

To illustrate:

library(mice)

### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)

### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp

## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)


## Stack data
data_long <- complete(imp, action = "long")

## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)

the long data gives overly small SEs, as we've increased the size of our dataset by 5x

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.906417   0.965090  -3.012   0.0026 ** 
var1         2.221053   0.311167   7.138 9.48e-13 ***
var2        -0.002543   0.010468  -0.243   0.8081    
var3         0.076955   0.032265   2.385   0.0171 *  
var4         0.006595   0.008031   0.821   0.4115   

Add weights

data_long$weight <- 1/nimp

library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)

summary(mod_svy)

The weighted regression gives similar SEs to the unweighted model

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.906417   1.036691  -2.804  0.00546 ** 
var1         2.221053   0.310906   7.144 1.03e-11 ***
var2        -0.002543   0.010547  -0.241  0.80967    
var3         0.076955   0.030955   2.486  0.01358 *  
var4         0.006595   0.008581   0.769  0.44288   

Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything

mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)

summary(mod_svy)
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.906417   1.036688  -2.804  0.00546 ** 
var1         2.221053   0.310905   7.144 1.03e-11 ***
var2        -0.002543   0.010547  -0.241  0.80967    
var3         0.076955   0.030955   2.486  0.01358 *  
var4         0.006595   0.008581   0.769  0.44288 

I would have expected SEs similar to those obtained when running a model in a single imputed dataset

## Assess SEs in single imputation
mod_singleimp <-  glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.679589   2.116806  -1.266  0.20556   
var1         2.476193   0.761195   3.253  0.00114 **
var2         0.014823   0.025350   0.585  0.55874   
var3         0.048940   0.072752   0.673  0.50114   
var4        -0.004551   0.017986  -0.253  0.80026  

All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.

Alternative options

the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.


Solution

  • Sorry, no, this isn't going to work.

    To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.

    Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).

    It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.