Extract confidence interval for both values of binary variable for glm()?

I want to analyze the relation between whether someone smoked or not and the number of drinks of alcohol.

The reproducible data set:

smoking_status	alcohol_drinks
1	2
0	5
1	2
0	1
1	0
1	0
0	0
1	9
1	6
1	5

I have used glm() to analyse this relation:

glm <- glm(smoking_status ~ alcohol_drinks, data = data, family = binomial)
summary(glm)
confint(glm)

Using the above I'm able to extract the p-value and the confidence interval for the entire set.

However, I would like to extract the confidence interval for each smoking status, so that I can produce this results table:

	Alcohol drinks, mean (95%CI)	p-values
Smokers	X (X - X)	0.492
Non-smokers	X (X - X)

How can I produce this?

Solution

First of all, the response alcohol_drinks is not binary, a logistic regression is out of the question. Since the response is counts data, I will fit a Poisson model.

To have confidence intervals for each binary value of smoking_status, coerce to factor and fit a model without an intercept.

x <- 'smoking_status  alcohol_drinks
1   2
0   5
1   2
0   1
1   0
1   0
0   0
1   9
1   6
1   5'
df1 <- read.table(textConnection(x), header = TRUE)


pois_fit <- glm(alcohol_drinks ~ 0 + factor(smoking_status), data = df1, family = poisson(link = "log"))
summary(pois_fit)
#> 
#> Call:
#> glm(formula = alcohol_drinks ~ 0 + factor(smoking_status), family = poisson(link = "log"), 
#>     data = df1)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6186  -1.7093  -0.8104   1.1389   2.4957  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value Pr(>|z|)    
#> factor(smoking_status)0   0.6931     0.4082   1.698   0.0895 .  
#> factor(smoking_status)1   1.2321     0.2041   6.036 1.58e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 58.785  on 10  degrees of freedom
#> Residual deviance: 31.324  on  8  degrees of freedom
#> AIC: 57.224
#> 
#> Number of Fisher Scoring iterations: 5
confint(pois_fit)
#> Waiting for profiling to be done...
#>                              2.5 %   97.5 %
#> factor(smoking_status)0 -0.2295933 1.399304
#> factor(smoking_status)1  0.8034829 1.607200
#>
exp(confint(pois_fit))
#> Waiting for profiling to be done...
#>                             2.5 %   97.5 %
#> factor(smoking_status)0 0.7948568 4.052378
#> factor(smoking_status)1 2.2333058 4.988822

^{Created on 2022-06-04 by the reprex package (v2.0.1)}

Edit

The edit to the question states that the problem was reversed, what is asked is to find out the effect of alcohol drinking on smoking status. And with a binary response, individuals can be smokers or not, a logistic regression is a possible model.

bin_fit <- glm(smoking_status ~ alcohol_drinks, data = df1, family = binomial(link = "logit"))
summary(bin_fit)
#> 
#> Call:
#> glm(formula = smoking_status ~ alcohol_drinks, family = binomial(link = "logit"), 
#>     data = df1)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.7491  -0.8722   0.6705   0.8896   1.0339  
#> 
#> Coefficients:
#>                Estimate Std. Error z value Pr(>|z|)
#> (Intercept)      0.3474     0.9513   0.365    0.715
#> alcohol_drinks   0.1877     0.2730   0.687    0.492
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 12.217  on 9  degrees of freedom
#> Residual deviance: 11.682  on 8  degrees of freedom
#> AIC: 15.682
#> 
#> Number of Fisher Scoring iterations: 4

# Odds ratios
exp(coef(bin_fit))
#>    (Intercept) alcohol_drinks 
#>       1.415412       1.206413
exp(confint(bin_fit))
#> Waiting for profiling to be done...
#>                    2.5 %    97.5 %
#> (Intercept)    0.2146432 11.167555
#> alcohol_drinks 0.7464740  2.417211

^{Created on 2022-06-05 by the reprex package (v2.0.1)}

Another way to conduct a logistic regression is to regress the cumulative counts of smokers on increasing numbers of alcoholic drinks. In order to do this, the data must be sorted by alcohol_drinks, so I will create a second data set, df2. Code inspired this in this RPubs post.

df2 <- df1[order(df1$alcohol_drinks), ]
Total <- sum(df2$smoking_status)
df2$smoking_status <- cumsum(df2$smoking_status)

fit <- glm(cbind(smoking_status, Total - smoking_status) ~ alcohol_drinks, data = df2, family = binomial())
summary(fit)
#> 
#> Call:
#> glm(formula = cbind(smoking_status, Total - smoking_status) ~ 
#>     alcohol_drinks, family = binomial(), data = df2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.9714  -0.2152   0.1369   0.2942   0.8975  
#> 
#> Coefficients:
#>                Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -1.1671     0.3988  -2.927 0.003428 ** 
#> alcohol_drinks   0.4437     0.1168   3.798 0.000146 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 23.3150  on 9  degrees of freedom
#> Residual deviance:  3.0294  on 8  degrees of freedom
#> AIC: 27.226
#> 
#> Number of Fisher Scoring iterations: 4
# Odds ratios
exp(coef(fit))
#>    (Intercept) alcohol_drinks 
#>      0.3112572      1.5584905
exp(confint(fit))
#> Waiting for profiling to be done...
#>                    2.5 %    97.5 %
#> (Intercept)    0.1355188 0.6569898
#> alcohol_drinks 1.2629254 2.0053079

plot(smoking_status/Total ~ alcohol_drinks, 
     data = df2,
     xlab = "Alcoholic Drinks",
     ylab = "Proportion of Smokers")
lines(df2$alcohol_drinks, fit$fitted, type="l", col="red")
title(main = "Alcohol and Smoking")

^{Created on 2022-06-05 by the reprex package (v2.0.1)}