Search code examples
rlogistic-regression

How to do logistic regression on summary data in R?


So I have some data that is structured similarly to the following:

         | Works  | DoesNotWork |
         ----------------------- 
Unmarried| 130    | 235         |
Married  | 10     | 95          |

I'm trying to use logistic regression to predict Work Status from the Marriage Status, however I don't think I understand how to in R. For example, if my data looks like the following:

MarriageStatus  | WorkStatus| 
-----------------------------
Married         | No        |
Married         | No        |
Married         | Yes       |
Unmarried       | No        |
Unmarried       | Yes       |
Unmarried       | Yes       |

I understand that I could do the following:

log_model <- glm(WorkStatus ~ MarriageStatus, data=MarriageDF, family=binomial(logit))

When the data is summarized, I just don't understand how to do this. Do I need to expand the data into a non-summarized form and encode Married/Unmarried as 0/1 and do the same for Working/Not Working and encode it as 0/1? .

Given only the first summary DF, how would I write the logistic regression glm function? Something like this?

log_summary_model <- glm(Works ~ DoesNotWork, data=summaryDF, family=binomial(logit))

But that doesn't make sense as I'm splitting the response dependent variable?

I'm not sure if I'm over complicating this, any help would be greatly appreciated , thanks!


Solution

  • You need to expand the contingency table into a data frame then a logit model can be calculated using the frequency count as a weight variable:

    mod <- glm(works ~ marriage, df, family = binomial, weights = freq)
    summary(mod) 
    
    Call:
    glm(formula = works ~ marriage, family = binomial, data = df, 
        weights = freq)
    
    Deviance Residuals: 
          1        2        3        4  
     16.383    6.858  -14.386   -4.361  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.5921     0.1093  -5.416 6.08e-08 ***
    marriage     -1.6592     0.3500  -4.741 2.12e-06 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 572.51  on 3  degrees of freedom
    Residual deviance: 541.40  on 2  degrees of freedom
    AIC: 545.4
    
    Number of Fisher Scoring iterations: 5
    

    Data:

    df <- read.table(text = "works marriage freq
                     1 0 130
                     1 1 10
                     0 0 235
                     0 1 95", header = TRUE)