Search code examples
remmeansmassr-marginaleffectsggeffects

Different plots of marginal effects with interaction term and offset when using different packages


I'm trying to understand why when I plot marginal effects using different packages I get different plots. Admittedly, I suspect it's because I do not understand how data transformations and offsets are being handled by the respective packages. I'm afraid I don't know how to make a reproducible sample of data, but I hope I can provide enough context below.

I have the following model using package MASS:

model <- glm(Lethal_removal ~ Federal_land + state / pred_rate + offset(log(subtotal)), data=df3, family=negative.binomial(1), na.action=na.fail)

The unit of analysis is state counties. State is a factor with 4 US states, pred_rate is an annual measure of how many livestock are eaten per county by wild predators (mean of 3.5 per year), lethal_removal is the number of wild predators that are culled per county by wildlife agencies (mean of 1.6 per year). Offset is number of years (subtotal) that predators are present in each county. I want to plot the marginal effects of pred_rate on lethal_removal by state while accounting for the subtotal offset (i.e., convert to an annual rate).

I used package emmeans first:

plot_em<-emmip(model, state ~ pred_rate, cov.reduce = range, offset = log(1), CIs = TRUE, plotit=FALSE)

#and plot it with the observed data    
    ggplot(as.data.frame(plot_em), aes(x=pred_rate, y=yvar, color=state)) + 
      geom_line() + geom_ribbon(aes(ymax=UCL, ymin=LCL, fill=state), alpha=0.4) + 
      geom_point(data = df3, aes(y=Lethal_removal/subtotal), size = 3)

Which produces:enter image description here

I then try ggeffects which gives me:

pr <- predict_response(model, c("pred_rate", "state"), condition = c(subtotal = 1))
plot(pr, show_data = TRUE)

Giving me this: enter image description here

Which suggests it was on an exponential scale, so I log-transformed the y-axis plot(pr, show_data = TRUE, log_y = TRUE) which gave me this: enter image description here

Which is very similar to the results from emmeans but the scale on the y-axis is way off. Frankly, the plot from emmeans makes intuitive sense, but my inability to easily recreate it leaves me concerned that I'm missing something and I'm not sure what is the correct marginal effects plot. Any advice huge appreciated.

Below is dput of the data.frame df3

structure(list(pred_rate = c(0.72, 11.6092307692308, 0.669090909090909, 
2.67636363636364, 18.9753846153846, 0, 0, 5.50153846153846, 7.6, 
0.0707692307692308, 5.2, 0.283076923076923, 0.306666666666667, 
2.47666666666667, 4.03076923076923, 9.83692307692308, 2.88727272727273, 
0, 0, 4.28923076923077, 1.28, 0.752727272727273, 17.7569230769231, 
0.849230769230769, 0, 0, 0, 6.65230769230769, 0.153333333333333, 
6.532, 2.15692307692308, 4.06153846153846, 0.750769230769231, 
0.167272727272727, 4.27692307692308, 16.7309090909091, 0, 1.37230769230769, 
0.345, 8.6, 2.06666666666667, 6.76, 1.6, 0, 0.353846153846154, 
0.501818181818182, 3.37333333333333, 0.0307692307692308, 3.36307692307692, 
0.131428571428571, 0, 0.153333333333333, 13.8615384615385, 10.9692307692308, 
2.52, 0.265, 19.3476923076923, 0.44, 0.141538461538462, 1.51384615384615, 
0.184, 0.283076923076923, 4.48, 1.4, 0.306666666666667, 7.65846153846154, 
10.3507692307692, 2.32, 2.92923076923077, 0, 0.849230769230769, 
0, 0.6, 7.02181818181818, 1.49538461538462, 1.37538461538462, 
0.683076923076923, 4.4, 2.65090909090909, 1.244, 26.9507692307692, 
0, 5.24307692307692, 0, 3.10769230769231, 0, 0.581538461538462, 
0.4), state = c("IDAHO", "IDAHO", "WASHINGTON", "OREGON", "MONTANA", 
"IDAHO", "MONTANA", "IDAHO", "IDAHO", "IDAHO", "IDAHO", "IDAHO", 
"MONTANA", "IDAHO", "IDAHO", "MONTANA", "MONTANA", "WASHINGTON", 
"OREGON", "IDAHO", "IDAHO", "WASHINGTON", "IDAHO", "MONTANA", 
"OREGON", "OREGON", "WASHINGTON", "IDAHO", "MONTANA", "WASHINGTON", 
"MONTANA", "IDAHO", "MONTANA", "WASHINGTON", "IDAHO", "MONTANA", 
"MONTANA", "MONTANA", "OREGON", "IDAHO", "OREGON", "IDAHO", "MONTANA", 
"OREGON", "MONTANA", "WASHINGTON", "OREGON", "IDAHO", "MONTANA", 
"OREGON", "OREGON", "IDAHO", "IDAHO", "MONTANA", "MONTANA", "IDAHO", 
"MONTANA", "MONTANA", "MONTANA", "MONTANA", "OREGON", "WASHINGTON", 
"MONTANA", "IDAHO", "WASHINGTON", "MONTANA", "MONTANA", "MONTANA", 
"MONTANA", "IDAHO", "MONTANA", "WASHINGTON", "WASHINGTON", "WASHINGTON", 
"MONTANA", "MONTANA", "IDAHO", "MONTANA", "OREGON", "OREGON", 
"IDAHO", "WASHINGTON", "OREGON", "OREGON", "IDAHO", "WASHINGTON", 
"MONTANA", "WASHINGTON"), Federal_land = c(42.534151492934, 63.9784662986647, 
16.6003160782432, 51.1028955500689, 58.947487328762, 9.06180914439123, 
0.911130670014708, 76.9081756955196, 73.1858648432361, 39.5200226761674, 
48.8486328184087, 60.7062703076754, 35.0609330068159, 63.2475807326415, 
64.3067163139122, 43.5861065699575, 12.4708493770657, 79.1564164555182, 
51.2009118836618, 61.6880505872881, 50.5134498548557, 28.5596173825025, 
92.9698127308114, 45.4524774304155, 75.2608706829412, 50.0736659471988, 
4.57522270679354, 69.066464511745, 17.0816340292832, 33.9771447677684, 
72.3480726901471, 58.9285919495093, 43.6772079658238, 20.7664717977071, 
36.9601044858487, 20.7302226596209, 4.31711620704182, 63.9399492882534, 
61.074079782981, 83.423569139067, 51.016535147847, 29.5419594948999, 
52.2630701694041, 27.6580391015359, 25.9703952858087, 34.142779076987, 
4.0917069e-07, 30.2891363697992, 17.0896884679476, 73.1092996462389, 
56.4327951729983, 15.8206520127824, 90.3193622408296, 48.5392976606421, 
74.4951345121614, 19.7985864424376, 46.6355884021095, 31.9051614956376, 
81.6803761570734, 54.5946431095638, 11.4225705007466, 46.0670532397187, 
53.2256541508368, 24.6637518111356, 58.0002021332745, 10.412284283733, 
50.4107537544319, 73.7125077034306, 51.7314184202192, 74.9360369140056, 
51.1361611998182, 40.779585438193, 1.87434204580111, 19.5796225789176, 
17.6532173966671, 24.695272565077, 33.809439417054, 17.4984242308043, 
20.8413651931395, 47.821008360042, 86.0044585359916, 2.10073799490256, 
58.2991711104475, 16.7330858857592, 36.7384554719021, 54.4145335580903, 
7.26384475698267, 0.659296220775652), Lethal_removal = c(3L, 
67L, 1L, 12L, 195L, 0L, 0L, 43L, 67L, 1L, 18L, 0L, 1L, 11L, 35L, 
8L, 3L, 0L, 0L, 16L, 116L, 2L, 100L, 5L, 0L, 0L, 0L, 61L, 0L, 
19L, 22L, 19L, 0L, 0L, 10L, 61L, 0L, 15L, 0L, 94L, 0L, 5L, 23L, 
0L, 1L, 0L, 0L, 4L, 57L, 0L, 0L, 4L, 84L, 136L, 43L, 9L, 99L, 
0L, 8L, 16L, 0L, 0L, 9L, 0L, 0L, 9L, 55L, 30L, 51L, 2L, 8L, 0L, 
0L, 14L, 5L, 0L, 2L, 7L, 2L, 0L, 86L, 0L, 11L, 0L, 10L, 0L, 2L, 
0L), subtotal = c(10L, 13L, 11L, 11L, 13L, 13L, 13L, 13L, 13L, 
13L, 11L, 13L, 12L, 12L, 13L, 13L, 11L, 10L, 3L, 13L, 13L, 11L, 
13L, 13L, 1L, 4L, 1L, 13L, 12L, 10L, 13L, 13L, 13L, 11L, 13L, 
11L, 12L, 13L, 8L, 13L, 9L, 2L, 13L, 2L, 13L, 11L, 9L, 13L, 13L, 
7L, 4L, 12L, 13L, 13L, 13L, 8L, 13L, 13L, 13L, 13L, 5L, 13L, 
13L, 1L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 5L, 1L, 11L, 13L, 
13L, 13L, 13L, 11L, 10L, 13L, 7L, 13L, 5L, 13L, 4L, 13L, 1L)), row.names = c(NA, 
-88L), class = "data.frame")

Solution

  • emmip by default returns predictions on the link scale - predict_response() always returns predictions on the response scale. That's why the axis limits are so different. If you use type = "response", you get the same, large axis range:

    emmeans::emmip(
      model,
      state ~ pred_rate,
      type = "response",
      cov.reduce = range,
      offset = log(1),
      CIs = TRUE,
      plotit = FALSE
    )
    

    So, the y-axis of the plot from predict_response() indicates average predicted counts / incidences, the y-axis from emmip() is by default the log of average predicted counts.

    Edits

    Some closer look at different ranges / subsets of the data. As you can see, it seems there are not many data points and probably some outliers for Oregon, which a) lead to spuriously high predicted values and b) large confidence intervals.

    library(MASS)
    library(ggeffects)
    
    model <- glm(Lethal_removal ~ Federal_land + state / pred_rate + offset(log(subtotal)), data = df3, family = negative.binomial(1), na.action = na.fail)
    
    # only three states, as Oregon has extreme values - furthermore, limit
    # the range of pred_rate to 0:7 - higher values have larger confidence intervals
    pr <- predict_response(model, terms = c("pred_rate [0:7]", "state [IDAHO, WASHINGTON, MONTANA]"))
    plot(pr, show_ci = FALSE)
    

    # include all states
    pr <- predict_response(model, terms = c("pred_rate [0:7]", "state"))
    plot(pr, show_ci = FALSE)
    

    # now add CI
    pr <- predict_response(model, terms = c("pred_rate [0:7]", "state [IDAHO, WASHINGTON, MONTANA]"))
    plot(pr, show_ci = TRUE)
    

    When you check model assumptions (e.g. using performance::check_model(), you see that you still seem to have some over dispersion, and that the data has more zeros than expected by your model. I assume your plots are not only a matter of log-scaled axis or not, but also about a) finding the right model and b) checking if you need to select a subset of your data so you have enough data values for prediction (or: limit the range of predictions for pred_rate).

    enter image description here