Search code examples
rplotregressionpredictmarginal-effects

Plot_cap response curve for counterfactual data


The following code was made up to replicate my problem with a bigger, more complex data set.

library(marginaleffects)
library(truncnorm)

yield_kgha<-rtruncnorm(n=100, mean=2000, sd=150)
n_kgha<-rtruncnorm(n=100, a=40, b=298, mean=150, sd=40)

i<-lm(yield_kgha~n_kgha+I(n_kgha^2))
summary(i)

I have used the predictions function from the marginaleffects package to see the predicted yields(yield_kgha) for a range of nitrogen rates (n_kgha) from my regression model (i). My original data only has n_kgha rates ranging from approximately 40-250, so the below code allows me to see predicted yields at n_kgha rates not actually in my data set.

p <- predictions(
  i,
  newdata = datagrid(model = i, n_kgha = seq(0, 300, by = 25), grid_type = "counterfactual"))
summary(p, by = "n_kgha")

I would like to plot the response of yield conditioned on n_kgha ranging from 0-300, which includes n_kgha values not in my original data set. I have tried to do this using the plot_cap function.

plot_cap(i, condition = "n_kgha")

However, since my original data only has n_kgha rates ranging from 40-250 I am not getting my desired result of seeing the response curve over the n_kgha (0:300) range. When I plot using the plot_cap function I get the following response curve with n_kgha ranging from 40-250 (the max and min of the original data set).

enter image description here

Is there a way to run the plot_cap function based on the counterfactual range of n_kgha as used in the predictions function? Or should I use another method to plot the predicted values based on counterfactual values?


Solution

  • The plot_cap() function only plots over the observed range of values. However, since predictions() returns a “tidy” data frame, it is trivial to use the the output of this function with ggplot2 to get the plot you need.

    Note that we do not need to specify grid.type="counterfactual". This is option will do something very weird and specific: duplicate the whole dataset many times for each value of the user-supplied values. It is only useful in very specific corner-cases, and not when you just want to make predictions over unobserved values of the predictors. See the documentation with ?datagrid.

    Here’s a simple example of predictions() with ggplot2 to achieve what you want:

    library(marginaleffects)
    library(truncnorm)
    library(ggplot2)
    
    yield_kgha <- rtruncnorm(n = 100, mean = 2000, sd = 150)
    n_kgha <- rtruncnorm(n = 100, a = 40, b = 298, mean = 150, sd = 40)
    
    i <- lm(yield_kgha ~ n_kgha + I(n_kgha^2))
    
    p <- predictions(i, newdata = datagrid(n_kgha = seq(0, 300, 10)))
    
    ggplot(p, aes(n_kgha, predicted)) +
        geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
        geom_line() +
        theme_minimal()