Search code examples
rlogistic-regressioncategorical-data

ggpredict with categorical logistic regression in R


I have a data-frame called "student" with 4 variables. I would like to perform multivariate logistic regression with one binary dependent variable "gender" that consists of two categorical values(F-M) as Female and Male and 3 independent variables (reading_score, math_score are continues "double" and lunch as categorical "character").

To start the logistic regression I converted the gender variable into a factor (as it was not working other way). Then, I used "glm" function with "binomial" family as below:

student$gender <- as.factor(student$gender)
glm.fit <- glm(gender~. , data = student, family = "binomial")

Now I would like to plot the model with "ggpredict" function. However, I usually got the same error

Error: Discrete value supplied to continuous scale

I tried to plot it using ggpredict as below:

ggpredict(glm.fit) %>% plot()

I really tried many tutorials and I saw many questions related to this topic. I did not figure it out yet.

The propose of using ggpredict for plotting that I have 3 independent variables.

Note: A sample of the data is presented in the figure below.

enter image description here


Solution

  • There is the package sjPlot which does this very well. You may have a look here for further explanations and examples.

    As I wanted to show a plot which comes close to your data, I have created a new DF with gender as a binomial distributed dependent variable.

    library(tidyverse)
    library(sjPlot)
    
    set.seed(123) # set seed 
    
    # create 1000 reading and math scores randomly
    math_score = sample(60:100, 1000, replace = T)
    reading_score = sample(60:100, 1000, replace = T)
    # create lunch randomly from "standard" or "free"
    lunch = sample(c('standard', 'free'), 1000, replace = T)
    
    # make data frmae
    student <- data.frame(
      math_score = math_score,
      reading_score = reading_score,
      lunch = as.factor(lunch)
    ) |> 
      # create a probability vector with some bias
      # bias is: if math_score is above its mean
      # prob is 0.8
      mutate(prob = ifelse(
        math_score > mean(math_score),
        0.8, 0.2
      )) |> 
      # create dependent variable as a binomial one
      # with prob as above
      mutate(gender = factor(
        rbinom(n = 1000, size = 1, prob = prob))) |> 
      select(-prob)
    
    # make the fit
    glm.fit <- glm(gender~., data = student, family = "binomial")
    
    summary(glm.fit)
    #> 
    #> Call:
    #> glm(formula = gender ~ ., family = "binomial", data = student)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -2.3264  -0.7876   0.3922   0.8381   2.3096  
    #> 
    #> Coefficients:
    #>                 Estimate Std. Error z value Pr(>|z|)    
    #> (Intercept)   -10.884194   0.877704 -12.401   <2e-16 ***
    #> math_score      0.120908   0.007850  15.403   <2e-16 ***
    #> reading_score   0.014737   0.006551   2.249   0.0245 *  
    #> lunchstandard   0.092162   0.152221   0.605   0.5449    
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 1386.2  on 999  degrees of freedom
    #> Residual deviance: 1045.7  on 996  degrees of freedom
    #> AIC: 1053.7
    #> 
    #> Number of Fisher Scoring iterations: 4
    

    The summary shows the expected result with math_score being significant. The model can be plot as follows.

    plot_model(
      glm.fit, 
      type = "pred", 
      terms = c("math_score", "reading_score", "lunch"), 
      colors = "bw",
      ci.lvl = NA
    )
    

    PS: Maybe it would be more instructive, if we could use your real data, or a part of your real data. Please have look to dput() as people normally do not want to type your data in. Another Good Read is How make good minimum working example