Search code examples
rpredict

What are the differences between ggpredict() and predictions()?


I wanted to generate some predicted probabilities and I was recommended two commands: ggeffects::ggpredict() and margin effects::predictions(). However, some of the results seem similar (the conf.high) and other results seem different (the conf.low). What is going on here? Any insights would be helpful.

Here's an example:

logit <- glm(vs ~ mpg + cyl + disp + hp + drat + wt + qsec + am + gear + carb, mtcars, family = "poisson")

library(ggeffects)
ggpredict(logit, terms=c("mpg")) %>% as_tibble()

library(marginaleffects)
predictions(logit, by=c("mpg")) %>% as_tibble()

Solution

  • The two functions do different things by default.

    1. ggpredict() with the terms argument computes predictions for each row of a data frame where mpg takes on values between 10 and 32, and all other variables are held at their means, including am, which is set to 0.41 even if it is presumably binary/categorical.
    2. predictions() with the by argument computes predictions for each row in the actually observed data frame, and then takes the average of these predictions for each unique value of mpg.

    In other words, ggpredict() reports predictions for “synthetic” data — hypothetical units which are exactly average along all dimensions except one. predictions() reports averages of predictions made on actually observed data.

    It is easy to reproduce the same results as ggpredict() with predictions(), by using the newdata argument and the datagrid() function:

    library(marginaleffects)
    library(ggeffects)
    logit <- glm(vs ~ mpg + cyl + disp + hp + drat + wt + qsec + am + gear + carb, mtcars, family = "poisson")
    
    ggpredict(logit,
      terms = c("mpg")) |>
      data.frame() |> head()
    #    x predicted std.error   conf.low conf.high group
    # 1 10 0.2243430 1.2888961 0.01793916 2.8055810     1
    # 2 12 0.2018065 1.0826924 0.02417384 1.6847077     1
    # 3 14 0.1815340 0.9117794 0.03039834 1.0840915     1
    # 4 16 0.1632979 0.7991309 0.03410037 0.7819915     1
    # 5 18 0.1468937 0.7707321 0.03243060 0.6653523     1
    # 6 20 0.1321375 0.8352217 0.02570893 0.6791536     1
    
    predictions(logit,
      newdata = datagrid(mpg = seq(10, 34, by = 2), am = mean)) |>
      head()
    # 
    #  mpg    am Estimate Pr(>|z|)   S  2.5 % 97.5 %  cyl disp  hp drat   wt qsec
    #   10 0.406    0.224   0.2462 2.0 0.0179  2.806 6.19  231 147  3.6 3.22 17.8
    #   12 0.406    0.202   0.1394 2.8 0.0242  1.685 6.19  231 147  3.6 3.22 17.8
    #   14 0.406    0.182   0.0613 4.0 0.0304  1.084 6.19  231 147  3.6 3.22 17.8
    #   16 0.406    0.163   0.0233 5.4 0.0341  0.782 6.19  231 147  3.6 3.22 17.8
    #   18 0.406    0.147   0.0128 6.3 0.0324  0.665 6.19  231 147  3.6 3.22 17.8
    #   20 0.406    0.132   0.0154 6.0 0.0257  0.679 6.19  231 147  3.6 3.22 17.8
    # 
    # Columns: rowid, estimate, p.value, s.value, conf.low, conf.high, vs, cyl, disp, hp, drat, wt, qsec, gear, carb, mpg, am 
    # Type:  invlink(link)
    

    PS: this is not a logit model.