Search code examples
rpredictionr-factor

How come releveling a factor variable gives wrong output (predictions table) in ggeffects::ggemmeans()?


I use ggeffects::ggemmeans() to get predictions from models, and I don't know whether I found a bug or otherwise doing things wrong. When using a factor variable as a predictor in the model, the output of ggemmeans() gets messed up when releveling the factor.

Example

Below there are two scenarios, a and b, in which I convert a data column to a factor, then fit a model with lm() and finally calculate predictions with ggemmeans().

library(ggplot2)
library(dplyr)
library(emmeans)
library(ggeffects)

# scenario a
# step a1 -- convert manufacturer col to factor
my_mpg_manuf_as_fac_a <- 
  mpg %>%
  mutate(across(manufacturer, factor))

levels(my_mpg_manuf_as_fac_a$manufacturer) ## the levels are ordered alphabetically

#>  [1] "audi"       "chevrolet"  "dodge"      "ford"       "honda"     
#>  [6] "hyundai"    "jeep"       "land rover" "lincoln"    "mercury"   
#> [11] "nissan"     "pontiac"    "subaru"     "toyota"     "volkswagen"

# step a2 -- model and get predictions
pred_a <-
  my_mpg_manuf_as_fac_a %>%
  lm(cty ~ manufacturer, data = .) %>%
  ggemmeans(terms = "manufacturer")

pred_a

#> # Predicted values of cty
#> # x = manufacturer
#> 
#> x          | Predicted |         95% CI
#> ---------------------------------------
#> audi       |     17.61 | [16.25, 18.97]
#> dodge      |     13.14 | [12.19, 14.08]
#> ford       |     14.00 | [12.85, 15.15]
#> hyundai    |     18.64 | [17.10, 20.18]
#> land rover |     11.50 | [ 8.62, 14.38]
#> mercury    |     13.25 | [10.37, 16.13]
#> pontiac    |     17.00 | [14.42, 19.58]
#> volkswagen |     20.93 | [19.82, 22.04]


# scenario b
# step b1 -- convert manufacturer col to factor (same as step a1)
my_mpg_manuf_as_fac_b <- 
  mpg %>%
  mutate(across(manufacturer, factor))

# step b2 -- change the order of levels in manufacturer
levels(my_mpg_manuf_as_fac_b$manufacturer) <- sort(levels(my_mpg_manuf_as_fac_b$manufacturer), decreasing = TRUE)

levels(my_mpg_manuf_as_fac_b$manufacturer) ## order of levels is now reveresed

#>  [1] "volkswagen" "toyota"     "subaru"     "pontiac"    "nissan"    
#>  [6] "mercury"    "lincoln"    "land rover" "jeep"       "hyundai"   
#> [11] "honda"      "ford"       "dodge"      "chevrolet"  "audi"

# step b3 -- model and get predictions
pred_b <-
  my_mpg_manuf_as_fac_b %>%
  lm(cty ~ manufacturer, data = .) %>%
  ggemmeans(terms = "manufacturer")

pred_b

#> # Predicted values of cty
#> # x = manufacturer
#> 
#> x          | Predicted |         95% CI
#> ---------------------------------------
#> volkswagen |     17.61 | [16.25, 18.97]
#> subaru     |     13.14 | [12.19, 14.08]
#> pontiac    |     14.00 | [12.85, 15.15]
#> mercury    |     18.64 | [17.10, 20.18]
#> land rover |     11.50 | [ 8.62, 14.38]
#> hyundai    |     13.25 | [10.37, 16.13]
#> ford       |     17.00 | [14.42, 19.58]
#> audi       |     20.93 | [19.82, 22.04]

Created on 2021-05-03 by the reprex package (v0.3.0)

When we compare pred_a and pred_b it's easy to see that the values in Predicted and 95% CI columns stay the same, even though the order of names in x column has changed.

pred_a

## # Predicted values of cty
## # x = manufacturer

## x          | Predicted |         95% CI
## ---------------------------------------
## audi       |     17.61 | [16.25, 18.97]
## dodge      |     13.14 | [12.19, 14.08]
## ford       |     14.00 | [12.85, 15.15]
## hyundai    |     18.64 | [17.10, 20.18]
## land rover |     11.50 | [ 8.62, 14.38]
## mercury    |     13.25 | [10.37, 16.13]
## pontiac    |     17.00 | [14.42, 19.58]
## volkswagen |     20.93 | [19.82, 22.04]


pred_b

## # Predicted values of cty
## # x = manufacturer

## x          | Predicted |         95% CI
## ---------------------------------------
## volkswagen |     17.61 | [16.25, 18.97]
## subaru     |     13.14 | [12.19, 14.08]
## pontiac    |     14.00 | [12.85, 15.15]
## mercury    |     18.64 | [17.10, 20.18]
## land rover |     11.50 | [ 8.62, 14.38]
## hyundai    |     13.25 | [10.37, 16.13]
## ford       |     17.00 | [14.42, 19.58]
## audi       |     20.93 | [19.82, 22.04]

Is this a bug or am I doing something wrong?


Solution

  • You should instead use the factor() function to relevel, because levels() doesn't really see the underlying data. When you use levels(), your entire data changes: audi becomes volkswagen, etc. But by passing the original vector to factor() you are preserving the values themselves.

    Data:

    manufacturers=c("audi","chevrolet","subaru","toyota","volkswagen")
    df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
    

    Before:

    > df$manufacturer
    [1] audi       audi       chevrolet  chevrolet  subaru     subaru     toyota     toyota     volkswagen volkswagen
    Levels: audi chevrolet subaru toyota volkswagen
    

    After:

    df$manufacturer = factor(df$manufacturer, levels = sort(levels(df$manufacturer),decreasing = T))
    > df$manufacturer
    [1] audi       audi       chevrolet  chevrolet  subaru     subaru     toyota     toyota     volkswagen volkswagen
    Levels: volkswagen toyota subaru chevrolet audi
    

    Compare this to:

    df = data.frame(mpg = runif(length(manufacturers)*2, 30, 50), manufacturer = rep(manufacturers, each = 2), stringsAsFactors = TRUE)
    levels(df$manufacturer) = sort(levels(df$manufacturer),decreasing = T)
    
    > df$manufacturer
    [1] volkswagen volkswagen toyota     toyota     subaru     subaru     chevrolet  chevrolet  audi       audi      
    Levels: volkswagen toyota subaru chevrolet audi
    

    which renamed the entire vector.