Search code examples
rregressionspliner-mice

r lrm from rms package and imputed data from mice


I apologize for my ignorance, I’m jumping into an analysis midway and would appreciate guidance on the appropriate functions to use for estimating odds ratios (ORs).

Here’s the situation: I’ve been provided with a dataset that has already been imputed using the mice package. The original dataset with missing values is not available, and the imputed dataset appears to have been generated using mice::complete(imputed_df, "long", include = FALSE). This is my starting point.

Here's the code to reproduce this dataset

library(mice)
library(rms)
library(parameters)
library(splines)

set.seed(123) # For reproducibility
# Number of observations
n <- 100

# outcome
y <- rbinom(n, size = 1, prob = 0.07)

# bmi
bmi <- runif(n, min = 16, max = 60)
bmi[sample(1:n, size = round(0.12 * n))] <- NA # Introduce missing values

# tiktok_ban
tiktok_ban <- sample(1:90, size = n, replace = TRUE)
#tiktok_ban[sample(1:n, size = round(0.05 * n))] <- 0 # Force some values to be 0

# sex
child_sex <- sample(c("Male", "Female"), size = n, replace = TRUE, prob = c(0.49, 0.51))
child_sex[sample(1:n, size = round(0.05 * n))] <- NA # Introduce missing values

# Combine into a data frame
df <- data.frame(y, bmi, tiktok_ban, child_sex)

# View first rows of the dataset
head(df)

#

imputed_df <- mice(df, print = FALSE, m = 20, seed = 24415, method = "pmm", printFlag = FALSE)

imputed_df_l <- mice::complete(imputed_df, "long", include = FALSE)


The goal is to estimate the Odds Ratio of an event (y).

I am modelling tiktok_ban as nonlinear spline with jump at day 20

d    <- imputed_df_l
dd   <- datadist(d);  options(datadist='dd')


Hmisc::describe(imputed_df_l$tiktok_ban)

k  <- attr(rcs(imputed_df_l$tiktok_ban, 6), 'parms')
k

h <- function(x) {
  z <- cbind(rcspline.eval(x, k),
             jump=x >= 20)
  attr(z, 'nonlinear') <- 2 : ncol(z)
  z
}

I am able to estimate the OR for this model on individual imputed dataset

f <- lrm(y ~ child_sex + bmi +
             gTrans(tiktok_ban, h),
             data= subset(imputed_df_l, .imp == 1)) 

summary(f)

Please see results here

How do I scale this code to run on multiple imputed datasets and generate pooled Odds Ratios and create a plot of Odds Ratios versus tiktok_ban based on the final pooled results?

Please note I cannot change the imputed datasets imputed_df_l , this is what I was given. Thanks in advance for any help

I have tried fit.mult.impute but ran into errors and unable to make it work for my usecase.


Solution

  • I've done something similar there. It follows Rubin's logic1, 2.

    > PF <- by(imputed_df_l, ~ .imp, \(x) {
    +   summary(lrm(y ~ child_sex + bmi + gTrans(tiktok_ban, h), data=x))
    + }) |> 
    +   simplify2array()
    > 
    > m. <- length(unique(imputed_df_l$.imp))
    > Q <- rowMeans(PF[, 'Effect', ])  ## calculate mean estimates
    > U <- rowMeans(PF[, 'S.E.', ])  ## calculate within variances
    > B <- rowSums(((PF[, 'Effect', ] - Q)^2))/(m. - 1)  ## calculate between variances 
    > T <- U + (1 + 1/m.)*B  ## calculate total variances 
    > cbind(Estimate=Q, 'Std. Error'=sqrt(T))
                                 Estimate Std. Error
    bmi                        -0.2392742  0.8981817
     Odds Ratio                 0.8175118         NA
    tiktok_ban                  9.2942707  2.8305722
     Odds Ratio             10985.6518883         NA
    child_sex - Male:Female     0.6758367  0.9786733
     Odds Ratio                 1.9660930         NA
    

    I leave the CIs and the plot to you.