Search code examples
rggplot2statisticsdata-analysisadjustment

How do I easily plot data in R (line plot), that were adjusted for confounding variables?


since I now understand how to line plot data with multiple timepoints in R using ggpubr package, I want to plot data that has been adjusted for some known confounding variables in the data set.

What I already tried is to calculate a linear model: outcome ~ confounder and plotted the residuals. However, when I do this (due to the nature of residuals) the values fluctuate around 0 and I lose much of the information regarding the original values. Is there any way, I can plot the outcome variable adjusted for one or more confounding variables where I can still grasp the original dimension of the data (e.g. I thought about subtracting the residuals from the original values).

I have a dataset about metabolic research which contains concentrations of blood plasma metabolites (glucose, amino acids etc) in the setting of a glucose tolerance test. I want to show that people with hepatic steatosis have higher metabolite concentrations than people without.

Here is what I already did. Please note, that I first have to exclude missing values from the dataset, so that the join works after adjustment. In the end I want to plot glycine in groups of steatosis (yes/no) adjusted for BMI.

Minimum dataset brought to you by chat GPT :D

# Creating a modified dataset
set.seed(123)

# Sample data with increasing Glycine values
main_study_long_T2D <- data.frame(
  BMI = rep(rnorm(10, mean = 25, sd = 5), 5),
  Glycine = rep(seq(50, 80, length.out = 10), 5) + ifelse(rep(c("yes", "no"), each = 5) == "yes", 10, 0) + rnorm(50, sd = 5),
  steatosis = rep(c("yes", "no"), each = 25),
  MMT_TIMEPOINT = rep(c(0, 30, 60, 120, 180), times = 10)
)
## Linear model to adjust Glycine for BMI
model_glycine <- lm(Glycine ~ BMI, data = main_study_long_T2D)
 
## Check
sum(!is.na(main_study_long_T2D$BMI))
sum(!is.na(main_study_long_T2D$Glycine)) ## Some patients / timepoints are missing in my original data
 
which(!is.na(main_study_long_T2D$Glycine))
 
## Calculate residuals
residuals_glycine <- residuals(model_glycine)
residuals_glycine
dim(residuals_glycine)
dim(as.data.frame(residuals_glycine))
 
## Patients / timepoints without missing glycine
identifiers <- main_study_long_T2D[!is.na(main_study_long_T2D$Glycine), c("PATNO", "MMT_TIMEPOINT")]
dim(identifiers)
length(residuals_glycine)
 
## Combine
residuals_data <- cbind(identifiers, residuals_glycine)
residuals_data
dim(residuals_data)
 
## Extend the dataset with residuals
main_study_long_T2D <- left_join(main_study_long_T2D, residuals_data,  by = c("PATNO", "MMT_TIMEPOINT"))
 
## ggline plot of residuals
main_study_long_T2D %>%
  filter(!(MMT_TIMEPOINT %in% c(-10, 90)) &
           DMTYP %in% c(0,2)) %>%
  mutate(DMTYP = as.factor(DMTYP)) %>%
  ggline(x="MMT_TIMEPOINT",
         y="residuals_glycine",  ## Use the calculated residuals
         color="steatosis",
         add=c("mean_se", "jitter"),
         add.params = list(alpha = 0.3),
         palette="jco") +
  stat_compare_means(aes(group = steatosis), label = "p.signif", method="t.test")

Solution

  • We can remove the effect of BMI by first regressing Glycine on BMI and steatosis.

    library(tidyverse)
    library(ggpubr)
    
    mod <- lm(Glycine ~ BMI + steatosis, data = main_study_long_T2D)
    

    This gives us a coefficient for the effect of BMI on Glycine. We then select which BMI we want to choose as the reference value for the purposes of plotting. Typically, this is the mean of BMI values in the study, though if you wanted you could set it to, say, 25.The difference between a BMI and the reference value multiplied by the coefficient is the amount that we add to the Glycine level to estimate a corrected Glycine level at the reference value:

    plot_df <- main_study_long_T2D %>%
      mutate(dif = coef(mod)[2] * (mean(BMI) - BMI),
             Glycine_corrected = Glycine + dif)
    

    Now we can use ggline to see a plot of corrected Glycine levels. We will compare this to uncorrected values. As you can see, the plots are similar but not identical:

    p1 <- ggline(plot_df,
           x = "MMT_TIMEPOINT",
           y = "Glycine",
           color = "steatosis",
           add = c("mean_se", "jitter"),
           add.params = list(alpha = 0.3),
           palette = "jco") +
      stat_compare_means(aes(group = steatosis), 
                         label = "p.signif", method = "t.test") +
      ylim(50, 90) +
      ggtitle("Uncorrected")
    
    p2 <- ggline(plot_df,
                 x = "MMT_TIMEPOINT",
                 y = "Glycine_corrected",
                 color = "steatosis",
                 add = c("mean_se", "jitter"),
                 add.params = list(alpha = 0.3),
                 palette = "jco") +
      stat_compare_means(aes(group = steatosis), 
                         label = "p.signif", method = "t.test") +
      ylim(50, 90) +
      ggtitle("Corrected for BMI")
    
    library(patchwork)
    
    p1 + p2
    

    enter image description here