Search code examples
rlogistic-regressionbinary-data

How Do You Plot a Probit Regression in R?


I am new to R so this may seem like a basic question; I am trying to estimate a probit regression of being Employed given the individual is a Male. I think I have the probit model correct but I am unable to plot it. Below is the first 10 rows from my dataset which has 60,000 in total. As you can see I have created 5 dummy variables: 'Male', 'LeavingCert', 'Bachelors', 'Married', and 'Employed'; (Although the first 10 rows in the Employed column are 0, this is not the case for the full dataset. However; there are significantly more 0s than 1s and perhaps this is my issue?)

Top10 <- head(data,10)
Top10

# A tibble: 10 × 10
    ...1   SEX MARSTAT MAINSTAT EDUCLEV4  Male LeavingCert Bachelors Married Employed
   <dbl> <dbl>   <dbl>    <dbl>    <dbl> <dbl>       <dbl>     <dbl>   <dbl>    <dbl>
 1     1     2       2        6        9     0           0         0       1        0
 2     2     2       1        2        9     0           0         0       0        0
 3     3     2       1        2        9     0           0         0       0        0
 4     4     2       2        2        9     0           0         0       1        0
 5     5     2       1        5        6     0           0         1       0        0
 6     6     1       2        3        9     1           0         0       1        0
 7     7     1       2        2        9     1           0         0       1        0
 8     8     2       3        2        9     0           0         0       0        0
 9     9     2       2        2        9     0           0         0       1        0
10    10     1       1        4        9     1           0         0       0        0

For my 'Probit1' model, my Y is 'Employed' and my X is 'Male'. My code is as follows;

Probit1 <- glm(Employed ~ Male, 
               family = binomial(link = "probit"), 
               data = data)
summary(Probit1)

I have tried to plot this probit regression as follows;

# plot data
plot(x = data$Male, 
     y = data$Employed,
     main = "Probit Model of the Probability of Employed, Given Male",
     xlab = "Male",
     ylab = "Employed",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.85)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Empolyed")
text(2.5, -0.1, cex= 0.8, "Unemployed")

# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(Probit1, list(Male = x), type = "response")

lines(x, y, lwd = 1.5, col = "steelblue")

This is the plot I am getting and it does not seem correct?

Probit1 Regression Plot

Are my data the problem?

Any help is very much appreciated and if this is not possible, is there another plot I could do? Thanks in advance.

summary(Probit1)

> summary(Probit1)

Call:
glm(formula = Employed ~ Male, family = binomial(link = "probit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2777  -0.2777  -0.2742  -0.2742   2.5689  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.78787    0.01299 -137.67   <2e-16 ***
Male         0.01141    0.01871    0.61    0.542    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19747  on 61939  degrees of freedom
Residual deviance: 19747  on 61938  degrees of freedom
AIC: 19751

Number of Fisher Scoring iterations: 5

table(data$Male, data$Employed)

> table(data$Male, data$Employed)
   
        0     1
  0 31165  1194
  1 28462  1119

Solution

  • Thank you for adding the regression summary and the table. From the regression summary we can see, that the coefficient for male is really small (and not significant):

    Male         0.01141    0.01871    0.61    0.542 
    

    That small coefficient means, that being male increases the chances of being employed only the tiniest bit so it makes sense, that we cannot see it in the plot, because the change is too small, the slope to little.

    Looking at the table

     > table(data$Male, data$Employed)
       
            0     1
      0 31165  1194
      1 28462  1119
    

    We can confirm, that there is no numerically or visually impressive increase: 1194/(1194+31165) = 3,7%, 1119/(1119+28462) = 3,8%. You cannot expect to see a .1% increase in your plot.

    Are my data the problem?

    They are the reason. I don't know, whether they are a problem.