Search code examples
rpredictiondata-transform

R ggpredict log transformed predictor


My apologies if this is already discussed somewhere, I am mainly interested to know how to use ggpredict() when one of my covariate/predictor variable is log transformed. For example if this is my model.

library(magrittr)
library(ggeffects)
library(sjmisc)
library(lme4)
library(splines)

Create dataset

set.seed(123)

dat <- data.frame(
  outcome = rbinom(n = 100, size = 1, prob = 0.35),
  var_binom = as.factor(rbinom(n = 100, size = 1, prob = 0.2)),
  var_cont = rnorm(n = 100, mean = 10, sd = 7),
  group = sample(letters[1:4], size = 100, replace = TRUE)
)

dat$var_cont <- sjmisc::std(dat$var_cont)

logistic regression model.

Please note here my predictor var_cont is log transformed.

m1 <- glmer(
  outcome ~ var_binom + log(var_cont) + (1 | group), 
  data = dat, 
  family = binomial(link = "logit")
)

GG predict on log transformed variable ?

 ggpredict(m1, "var_cont" ????)

Solution

  • It depends a bit on exactly what you want. Here are a few possibilities:

    library(ggeffects)
    library(lme4)
    #> Loading required package: Matrix
    
    set.seed(123)
    
    dat <- data.frame(
      outcome = rbinom(n = 100, size = 1, prob = 0.35),
      var_binom = as.factor(rbinom(n = 100, size = 1, prob = 0.2)),
      var_cont = abs(rnorm(n = 100, mean = 10, sd = 7)),
      group = sample(letters[1:4], size = 100, replace = TRUE)
    )
    
    m1 <- glmer(
      outcome ~ var_binom + log(var_cont) + (1 | group), 
      data = dat, 
      family = binomial(link = "logit"))
    

    Just doing ggpredict(m1, "var_cont") will give you predicted values of y for the original values of x (not the logged values, ggpredict() does the transformation for you automatically).

    ggpredict(m1, "var_cont")
    #> Data were 'prettified'. Consider using `terms="var_cont [all]"` to get
    #>   smooth plots.
    #> # Predicted probabilities of outcome
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.36 | [0.24, 0.50]
    #>       10 |      0.38 | [0.25, 0.52]
    #>       15 |      0.39 | [0.25, 0.55]
    #>       20 |      0.40 | [0.25, 0.57]
    #>       25 |      0.40 | [0.24, 0.59]
    #>       30 |      0.41 | [0.23, 0.61]
    #>       35 |      0.41 | [0.23, 0.62]
    #> 
    #> Adjusted for:
    #> * var_binom = 0
    #> *     group = 0 (population-level)
    ggpredict(m1, "var_cont", type="re")
    

    The code above does not incorporate the random effect at all, as it uses type="fe" by default. You could include the random effect variance in the calculation of the confidence intervals, which would be prediction intervals if the random effect variance is used. This doesn't change the predictions themselves, only the variance.

    #> Data were 'prettified'. Consider using `terms="var_cont [all]"` to get
    #>   smooth plots.
    #> # Predicted probabilities of outcome
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.36 | [0.07, 0.81]
    #>       10 |      0.38 | [0.07, 0.83]
    #>       15 |      0.39 | [0.08, 0.83]
    #>       20 |      0.40 | [0.08, 0.84]
    #>       25 |      0.40 | [0.08, 0.85]
    #>       30 |      0.41 | [0.08, 0.85]
    #>       35 |      0.41 | [0.08, 0.86]
    #> 
    #> Adjusted for:
    #> * var_binom = 0
    #> *     group = 0 (population-level)
    #> 
    #> Intervals are prediction intervals.
    

    Finally, if you wanted to include the estimate of the random intercept in the calculation of the prediction, you can add your grouping variable "group" into the terms being considered and use type="re". This gives prediction rather than confidence intervals, but gives a different set of predictions for each group using the group-level intercepts.

    ggpredict(m1, c("var_cont","group"), type="re")
    #> Data were 'prettified'. Consider using `terms="var_cont [all]"` to get
    #>   smooth plots.
    #> # Predicted probabilities of outcome
    #> 
    #> # group = a
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.32 | [0.06, 0.78]
    #>       15 |      0.35 | [0.06, 0.81]
    #>       20 |      0.35 | [0.06, 0.82]
    #>       35 |      0.37 | [0.06, 0.83]
    #> 
    #> # group = b
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.31 | [0.06, 0.78]
    #>       15 |      0.34 | [0.06, 0.80]
    #>       20 |      0.35 | [0.06, 0.81]
    #>       35 |      0.36 | [0.06, 0.83]
    #> 
    #> # group = c
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.40 | [0.08, 0.84]
    #>       15 |      0.43 | [0.09, 0.85]
    #>       20 |      0.44 | [0.09, 0.86]
    #>       35 |      0.45 | [0.09, 0.87]
    #> 
    #> # group = d
    #> 
    #> var_cont | Predicted |       95% CI
    #> -----------------------------------
    #>        0 |      0.00 |             
    #>        5 |      0.42 | [0.09, 0.85]
    #>       15 |      0.45 | [0.10, 0.87]
    #>       20 |      0.46 | [0.10, 0.87]
    #>       35 |      0.48 | [0.10, 0.89]
    #> 
    #> Adjusted for:
    #> * var_binom = 0
    #> 
    #> Intervals are prediction intervals.
    

    Created on 2023-03-21 with reprex v2.0.2

    In all these cases, the predictions are made for the original values of x, rather than the logged values. If you wanted the plot on the log scale, you could add scale_x_log10() to the plot

    One other piece of advice, I noticed that you had dat$var_cont <- sjmisc::std(dat$var_cont) in your code (which I removed). This will cause all sorts of problems if you go on to log var_cont as roughly half of the observations will be non-positive and thus turned to NA by the log transformation. In general, your observations should be strictly positive for the log transformation. If your variable includes zero, an inverse hyperbolic sine transformation asinh() in R, would work. If your variables include negative values, too, and require transformation then try the powerTransform() function from the car package with family family="bcnPower".