Search code examples
rpoolimputationr-miceseemingly-unrelated-regression

Seemingly unrelated regression in R with imputed data-Pooling results


I am trying to complete seemingly unrelated regressions (SUR) using the systemfit package in R. However, it is not straightforward to complete these analyses with multiply imputed data (with mice package).

Upon googling this question, I see that there was a deleted post about the identical question, which seems to have utilized the following example (credit to poster, minor edits)

library(systemfit)
library(mice)
nhanes2

r1 <- bmi ~ hyp 

r2 <- bmi ~ age

system <- list( r1, r2 )

imp <- mice(nhanes2, m = 5)
  m=5
  completed=lapply(1:5,function(i)complete(imp,i))
  fit.model <- systemfit(system, data= completed[[1]])

Above yields complete systemfit output for each imputed dataset. This is great but I am left with the task of pooling the entire output generated by SUR.

I have also unsuccessfully attempted running the function in zelig:

  completed.mi=do.call(Zelig:mi,completed)
  system=list(r1= bmi ~ hyp,r2=bmi~age)
  z.out=zelig(formula= system,model="sur",data=completed.mi)
  >Error: sur is not a supported model type.

Finally, calling the long form of the imputed data yields large degrees of freedom which does not reflect the actual number of cases in each imputed dataset (example not included)

My questions are:

1) Does the systemfit package support SUR for MI data?

2) If so, are you able to pool the output across all imputed datasets?

3) Are there alternative package options (other than systemfit) for completing SUR in R?

4) If 3 is a no, is there a similar analysis that will accomplish the same objectives and is there a different package(e.g., rms) that might support the pooling of MI data?


Solution

  • I don't think mice support pooling the results from SUR. You have to pool the results manually using Rubin's rules. I can go up to a certain point using your example, perhaps you can take it from there.

    library(systemfit)
    library(mice)
    nhanes2
    
    # add two imputation as example
    imp <- mice(nhanes2, m = 2)
    m=2
    
    # create a data set with all the 
    # complete imputed data sets
    df<-mice::complete(imp, action="long", include=FALSE)
    
    #create separate data frames for each mi
    for(i in (df$.imp)) {
      nam <- paste0("df_", i)
      assign(nam, df[df$.imp==i,])
    }
    
    # Store the coefficients and se of each 
    # sur in the M imputed data sets
    
    M <-2 # number of imputations
    M2 <- M*2 #number of total sur regressions
    results <- as.data.frame(matrix(NA, nrow=M2, ncol = 4)) # will store here coefficients and se
    colnames(results)<-c("coef_r1", "coef_r2", "se_r1", "se_r2")
    
    # perform sur 
    r1 <- bmi ~ hyp 
    r2 <- bmi ~ age
    system <- list( r1, r2 )
    
    # start with first data set
    fitsur1 <- systemfit(list( r1= r1, r2 = r2),
                    data=df_1)
    # start with second data set
    fitsur2 <- systemfit(list( r1= r1, r2 = r2),
                     data=df_2)
    
    # this can be warped in a loop
    # but I could not do it...
    
    # Extract coefficients
    # Note I extract the coefficient 
    # from only one age-group of r2, 
    # Use same approach for the other
    # extract coef from fitsur1
    a <- as.data.frame(summary(fitsur1 )$coefficients[2,1])
    colnames(a)<-c("coef_r1")
    b <- as.data.frame(summary(fitsur1 )$coefficients[4,1])
    colnames(b)<-c("coef_r2")
    ab <- cbind(a,b)
    
    # extract coef from fitsur2
    c <- as.data.frame(summary(fitsur2 )$coefficients[2,1])
    colnames(c)<-c("coef_r1")
    d <- as.data.frame(summary(fitsur2 )$coefficients[4,1])
    colnames(d)<-c("coef_r2")
    cd <- cbind(c,d)
    
    
    # Follow same approach to extract SE
    # I cannot manage to extract them from 
    # the 'fitsur' list ...
    
    # merge extracted coef and se
    results <- rbind(ab, cd)
    
    # Then bind the coefficients and se
    # from all imputed regressions
    
    # Calculate the mean of pooled coefficients
    pooled.coef_r1 <- mean(results$coef_r1)
    pooled.coef_r2 <- mean(results$coef_r2)
    

    Calculating the pooled SE is more complex I used this post https://stats.stackexchange.com/questions/327237/calculating-pooled-p-values-manually

    # example for se_r1
    (betweenVar <- mean(results[,3])) # mean of variances
    (withinVar <- sd(results[,1])^2) # variance of variances
    (dfCorrection <- (nrow(results)+1)/(nrow(results))) # dfCorrection
    
    (totVar <- betweenVar + withinVar*dfCorrection) # total variance
    (pooledSE <- sqrt(totVar)) # standard error
    

    I have not looked into p values yet but should be easier now