I use ggeffects::ggemmeans()
to get predictions from models, and I don't know whether I found a bug or otherwise doing things wrong. When using a factor variable as a predictor in the model, the output of ggemmeans()
gets messed up when releveling the factor.
Below there are two scenarios, a
and b
, in which I convert a data column to a factor, then fit a model with lm()
and finally calculate predictions with ggemmeans()
.
library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)
# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <-
mpg %>%
mutate(across(manufacturer, factor))
levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically
#> [1] "audi" "chevrolet" "dodge" "ford" "honda"
#> [6] "hyundai" "jeep" "land rover" "lincoln" "mercury"
#> [11] "nissan" "pontiac" "subaru" "toyota" "volkswagen"
# step a2 -- model and get predictions
pred_a <-
my_mpg_manuf_as_fac_a %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_a
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> audi | 17.61 | [16.25, 18.97]
#> dodge | 13.14 | [12.19, 14.08]
#> ford | 14.00 | [12.85, 15.15]
#> hyundai | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> mercury | 13.25 | [10.37, 16.13]
#> pontiac | 17.00 | [14.42, 19.58]
#> volkswagen | 20.93 | [19.82, 22.04]
# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <-
mpg %>%
mutate(across(manufacturer, factor))
# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer), decreasing = TRUE)
levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is now reveresed
#> [1] "volkswagen" "toyota" "subaru" "pontiac" "nissan"
#> [6] "mercury" "lincoln" "land rover" "jeep" "hyundai"
#> [11] "honda" "ford" "dodge" "chevrolet" "audi"
# step b3 -- model and get predictions
pred_b <-
my_mpg_manuf_as_fac_b %>%
lm(cty ~ manufacturer, data = .) %>%
ggemmeans(terms = "manufacturer")
pred_b
#> # Predicted values of cty
#> # x = manufacturer
#>
#> x | Predicted | 95% CI
#> ---------------------------------------
#> volkswagen | 17.61 | [16.25, 18.97]
#> subaru | 13.14 | [12.19, 14.08]
#> pontiac | 14.00 | [12.85, 15.15]
#> mercury | 18.64 | [17.10, 20.18]
#> land rover | 11.50 | [ 8.62, 14.38]
#> hyundai | 13.25 | [10.37, 16.13]
#> ford | 17.00 | [14.42, 19.58]
#> audi | 20.93 | [19.82, 22.04]
Created on 2021-05-03 by the reprex package (v0.3.0)
When we compare pred_a
and pred_b
it's easy to see that the values in Predicted
and 95% CI
columns stay the same, even though the order of names in x
column has changed.
pred_a
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## audi | 17.61 | [16.25, 18.97]
## dodge | 13.14 | [12.19, 14.08]
## ford | 14.00 | [12.85, 15.15]
## hyundai | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## mercury | 13.25 | [10.37, 16.13]
## pontiac | 17.00 | [14.42, 19.58]
## volkswagen | 20.93 | [19.82, 22.04]
pred_b
## # Predicted values of cty
## # x = manufacturer
## x | Predicted | 95% CI
## ---------------------------------------
## volkswagen | 17.61 | [16.25, 18.97]
## subaru | 13.14 | [12.19, 14.08]
## pontiac | 14.00 | [12.85, 15.15]
## mercury | 18.64 | [17.10, 20.18]
## land rover | 11.50 | [ 8.62, 14.38]
## hyundai | 13.25 | [10.37, 16.13]
## ford | 17.00 | [14.42, 19.58]
## audi | 20.93 | [19.82, 22.04]
Is this a bug or am I doing something wrong?
You should instead use the factor()
function to relevel, because levels()
doesn't really see the underlying data. When you use levels()
, your entire data changes: audi
becomes volkswagen
, etc. But by passing the original vector to factor()
you are preserving the values themselves.
Data:
manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
Before:
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: audi chevrolet subaru toyota volkswagen
After:
df$manufacturer = factor(df$manufacturer, levels = sort(levels(df$manufacturer),decreasing = T))
> df$manufacturer
[1] audi audi chevrolet chevrolet subaru subaru toyota toyota volkswagen volkswagen
Levels: volkswagen toyota subaru chevrolet audi
Compare this to:
df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)
> df$manufacturer
[1] volkswagen volkswagen toyota toyota subaru subaru chevrolet chevrolet audi audi
Levels: volkswagen toyota subaru chevrolet audi
which renamed the entire vector.