Search code examples
rplotlogistic-regression

Plotting a multiple logistic regression for binary and continuous values in R


I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).

I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.

Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[  ,"Basin"], family="binomial")

I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).

I know how to do this for a single variable regression. For example, I could use a plot like this one if I wanted to see the relationship between just geographic range size and presence in the fossil record.

However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?

Thanks for the help!


Solution

  • Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package

    ##generating data
    ssize <- 100
    set.seed(12345)
    dat <- data.frame(
      Basin =  rbinom(ssize, 1,.4),
      GeographicRange = rnorm(ssize,10,2)
    )
    dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
    
    ##fitting model
    fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
    

    We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.

    ##getting predicted response from model
    plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
                               Basin = (0:1))
    plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
    

    Now you can plot the predicted results:

    ##plotting the predicted response on the two covariates
    library(ggplot2)
    pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
    pl + 
      geom_point( ) +
      ggtitle("Predicted FossilRecord by GeoRange and Basin") + 
      ggplot2::ylab("Predicted FossilRecord")
    

    This will produce a figure like this: enter image description here