Search code examples
rglmconfidence-interval

Extract confidence interval for both values of binary variable for glm()?


I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.

The reproducible data set:

smoking_status alcohol_drinks
1 2
0 5
1 2
0 1
1 0
1 0
0 0
1 9
1 6
1 5

I have used glm() to analyse this relation:

glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)

Using the above I'm able to extract the p-value and the confidence interval for the entire set.

However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:

Alcohol drinks, mean (95%CI) p-values
Smokers X (X - X) 0.492
Non-smokers X (X - X)

How can I produce this?


Solution

  • First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.

    To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.

    x <- 'smoking_status  alcohol_drinks
    1   2
    0   5
    1   2
    0   1
    1   0
    1   0
    0   0
    1   9
    1   6
    1   5'
    df1 <- read.table(textConnection(x), header = TRUE)
    
    
    pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
    summary(pois_fit)
    #> 
    #> Call:
    #> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"), 
    #>     data = df1)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -2.6186  -1.7093  -0.8104   1.1389   2.4957  
    #> 
    #> Coefficients:
    #>                         Estimate Std. Error z value Pr(>|z|)    
    #> factor(smoking_status)0   0.6931     0.4082   1.698   0.0895 .  
    #> factor(smoking_status)1   1.2321     0.2041   6.036 1.58e-09 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for poisson family taken to be 1)
    #> 
    #>     Null deviance: 58.785  on 10  degrees of freedom
    #> Residual deviance: 31.324  on  8  degrees of freedom
    #> AIC: 57.224
    #> 
    #> Number of Fisher Scoring iterations: 5
    confint(pois_fit)
    #> Waiting for profiling to be done...
    #>                              2.5 %   97.5 %
    #> factor(smoking_status)0 -0.2295933 1.399304
    #> factor(smoking_status)1  0.8034829 1.607200
    #>
    exp(confint(pois_fit))
    #> Waiting for profiling to be done...
    #>                             2.5 %   97.5 %
    #> factor(smoking_status)0 0.7948568 4.052378
    #> factor(smoking_status)1 2.2333058 4.988822
    

    Created on 2022-06-04 by the reprex package (v2.0.1)


    Edit

    The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.

    bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
    summary(bin_fit)
    #> 
    #> Call:
    #> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"), 
    #>     data = df1)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -1.7491  -0.8722   0.6705   0.8896   1.0339  
    #> 
    #> Coefficients:
    #>                Estimate Std. Error z value Pr(>|z|)
    #> (Intercept)      0.3474     0.9513   0.365    0.715
    #> alcohol_drinks   0.1877     0.2730   0.687    0.492
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 12.217  on 9  degrees of freedom
    #> Residual deviance: 11.682  on 8  degrees of freedom
    #> AIC: 15.682
    #> 
    #> Number of Fisher Scoring iterations: 4
    
    # Odds ratios
    exp(coef(bin_fit))
    #>    (Intercept) alcohol_drinks 
    #>       1.415412       1.206413
    exp(confint(bin_fit))
    #> Waiting for profiling to be done...
    #>                    2.5 %    97.5 %
    #> (Intercept)    0.2146432 11.167555
    #> alcohol_drinks 0.7464740  2.417211
    

    Created on 2022-06-05 by the reprex package (v2.0.1)


    Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.

    df2 <- df1[order(df1$alcohol_drinks), ]
    Total <- sum(df2$smoking_status)
    df2$smoking_status <- cumsum(df2$smoking_status)
    
    fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
    summary(fit)
    #> 
    #> Call:
    #> glm(formula = cbind(smoking_status, Total - smoking_status) ~ 
    #>     alcohol_drinks, family = binomial(), data = df2)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -0.9714  -0.2152   0.1369   0.2942   0.8975  
    #> 
    #> Coefficients:
    #>                Estimate Std. Error z value Pr(>|z|)    
    #> (Intercept)     -1.1671     0.3988  -2.927 0.003428 ** 
    #> alcohol_drinks   0.4437     0.1168   3.798 0.000146 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 23.3150  on 9  degrees of freedom
    #> Residual deviance:  3.0294  on 8  degrees of freedom
    #> AIC: 27.226
    #> 
    #> Number of Fisher Scoring iterations: 4
    # Odds ratios
    exp(coef(fit))
    #>    (Intercept) alcohol_drinks 
    #>      0.3112572      1.5584905
    exp(confint(fit))
    #> Waiting for profiling to be done...
    #>                    2.5 %    97.5 %
    #> (Intercept)    0.1355188 0.6569898
    #> alcohol_drinks 1.2629254 2.0053079
    
    plot(smoking_status/Total ~ alcohol_drinks, 
         data = df2,
         xlab = "Alcoholic Drinks",
         ylab = "Proportion of Smokers")
    lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
    title(main = "Alcohol and Smoking")
    

    Created on 2022-06-05 by the reprex package (v2.0.1)