Search code examples
rlogistic-regressionglmnetlasso-regression

Aggregated logistic lasso regression in glmnet


In glm() it is possible to model bernoulli [0,1] outcomes with a logistic regression using the following sort of syntax.

glm(bin ~ x, df, family = "binomial")

However you can also perform aggregated binomial regression, where each observation represents a count of target events from a certain fixed number of bernoulli trials. For example see the following data:

set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)

With these sort of data you use slightly different syntax in glm()

mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod

# output

# Call:  glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
# 
# Coefficients:
#   (Intercept)            x  
# -0.3064       0.6786  
# 
# Degrees of Freedom: 49 Total (i.e. Null);  48 Residual
# Null Deviance:        53.72 
# Residual Deviance: 39.54  AIC: 178

I was wondering is it possible to model this type of aggregated binomial data in the glmnet package? If so, what is the syntax?


Solution

  • Yes you can do it as the following

    set.seed(1)
    n <- 50
    cov <- 10
    x <- c(rep(0,n/2), rep(1, n/2))
    x = cbind(x, xx = c(rep(0.5,20), rep(0.7, 20), rep(1,10)))
    p <- 0.4 + 0.2*x
    y <- rbinom(n, cov, p)
    

    I added another covariate here called xx as glmnet accepts minimum of two covariates

    In glm as you have it in your post

    mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
    mod
    
    # output
    # Call:  glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
    
    # Coefficients:
    # (Intercept)           xx          xxx  
    # 0.04366      0.86126     -0.64862  
    
    # Degrees of Freedom: 49 Total (i.e. Null);  47 Residual
    # Null Deviance:        53.72 
    # Residual Deviance: 38.82  AIC: 179.3
    

    In glmnet, without regularization (lambda=0) to reproduce similar results as in glm

    library(glmnet)
    fit = glmnet(x, cbind(cov-y,y), family="binomial", lambda=0)
    coef(fit)
    # output
    # 3 x 1 sparse Matrix of class "dgCMatrix"
    #                     s0
    # (Intercept)  0.04352689
    # x            0.86111234
    # xx          -0.64831806