Search code examples
rlogistic-regressionglm

Default starting values fitting logistic regression with glm


I'm wondering how are default starting values specified in glm.

This post suggests that default values are set as zeros. This one says that there is an algorithm behind it, however relevant link is broken.

I tried to fit simple logistic regression model with algorithm trace:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)

# to see parameter estimates in each step
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))

First, without specification of initial values:

glm(y ~ x, family = "binomial")

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

In the first step, initial values are NULL.

Second, I set starting values to be zeros:

glm(y ~ x, family = "binomial", start = c(0, 0))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0 0
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3177530 0.9097521
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3909975 1.1397163
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3994147 1.1666173
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995191 1.1669518

And we can see that iterations between first and second approach differ.

To see initial values specified by glm I tried to fit model with only one iteration:

glm(y ~ x, family = "binomial", control = list(maxit = 1))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL

Call:  glm(formula = y ~ x, family = "binomial", control = list(maxit = 1))

Coefficients:
(Intercept)            x  
     0.3864       1.1062  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      134.6 
Residual Deviance: 115  AIC: 119

Estimates of parameters (not surprisingly) correspond to estimates of the first approach in the second iteration i.e., [1] 0.386379 1.106234 Setting these values as initial values leads to the same iterations sequence as in the first approach:

glm(y ~ x, family = "binomial", start = c(0.386379, 1.106234))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

So the question is, how these values are calculated?


Solution

  • TL;DR

    • start=c(b0,b1) initializes eta to b0+x*b1 (mu to 1/(1+exp(-eta)))

    • start=c(0,0) initializes eta to 0 (mu to 0.5) regardless of y or x value.

    • start=NULL initializes eta= 1.098612 (mu=0.75) if y=1, regardless of x value.

    • start=NULL initializes eta=-1.098612 (mu=0.25) if y=0, regardless of x value.

    • Once eta (and consequently mu and var(mu)) has been calculated, w and z are calculated and sent to a QR solver, in the spirit of qr.solve(cbind(1,x) * w, z*w).

    Longform

    Building off Roland's comment: I made a glm.fit.truncated(), where I took glm.fit down to the C_Cdqrls call, and then commented it out. glm.fit.truncated outputs the z and w values (as well as the values of the quantities used to calculate z and w) which would then be passed to the C_Cdqrls call:

    ## call Fortran code via C wrapper
    fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
                 min(1e-7, control$epsilon/1000), check=FALSE) 
    

    More can be read about C_Cdqrls here. Luckily, the function qr.solve in base R taps directly into the LINPACK versions being called upon in glm.fit().

    So we run glm.fit.truncated for the different starting value specifications, and then do a call to qr.solve with the w and z values, and we see how the "starting values" (or the first displayed iteration values) are calculated. As Roland indicated, specifying start=NULL or start=c(0,0) in glm() affects the calculations for w and z, not for start.

    For the start=NULL: z is a vector where the elements have the value 2.431946 or -2.431946 and w is a vector where all elements are 0.4330127:

    start.is.null <- glm.fit.truncated(x,y,family=binomial(), start=NULL)
    start.is.null
    w <- start.is.null$w
    z <- start.is.null$z
    ## if start is NULL, the first displayed values are:
    qr.solve(cbind(1,x) * w, z*w)  
    # > qr.solve(cbind(1,x) * w, z*w)  
    #                 x 
    # 0.386379 1.106234 
    

    For the start=c(0,0): z is a vector where the elements have the value 2 or -2 and w is a vector where all elements are 0.5:

    ## if start is c(0,0)    
    start.is.00 <- glm.fit.truncated(x,y,family=binomial(), start=0)
    start.is.00
    w <- start.is.00$w
    z <- start.is.00$z
    ## if start is c(0,0), the first displayed values are:    
    qr.solve(cbind(1,x) * w, z*w)  
    # > qr.solve(cbind(1,x) * w, z*w)  
    #                   x 
    # 0.3177530 0.9097521 
    

    So that's all well and good, but how do we calculate the w and z? Near the bottom of glm.fit.truncated() we see

    z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
    w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
    

    Look at the following comparisons between the outputted values of the quantities used to calculate z and w:

    cbind(y, start.is.null$mu, start.is.00$mu)
    cbind(y, start.is.null$eta, start.is.00$eta)
    cbind(start.is.null$var_mu, start.is.00$var_mu)
    cbind(start.is.null$mu.eta.val, start.is.00$mu.eta.val)
    

    Note that start.is.00 will have vector mu with only the values 0.5 because eta is set to 0 and mu(eta) = 1/(1+exp(-0))= 0.5. start.is.null sets those with y=1 to be mu=0.75 (which corresponds to eta=1.098612) and those with y=0 to be mu=0.25 (which corresponds to eta=-1.098612), and thus the var_mu = 0.75*0.25 = 0.1875.

    However, it is interesting to note, that I changed the seed and reran everything and the mu=0.75 for y=1 and mu=0.25 for y=0 (and thus the other quantities stayed the same). That is to say, start=NULL gives rise to the same w and z regardless of what y and x are, because they initialize eta=1.098612 (mu=0.75) if y=1 and eta=-1.098612 (mu=0.25) if y=0.

    So it appears that a starting value for the Intercept coefficient and for the X-coefficient is not set for start=NULL, but rather initial values are given to eta depending on the y-value and independent of the x-value. From there w and z are calculated, then sent along with x to the qr.solver.

    Code to run before the chunks above:

    set.seed(123)
    
    x <- rnorm(100)
    p <- 1/(1 + exp(-x))
    y <- rbinom(100, size = 1, prob = p)
    
    
    glm.fit.truncated <- function(x, y, weights = rep.int(1, nobs), 
    start = 0,etastart = NULL, mustart = NULL, 
    offset = rep.int(0, nobs),
    family = binomial(), 
    control = list(), 
    intercept = TRUE,
    singular.ok = TRUE
    ){
    control <- do.call("glm.control", control)
    x <- as.matrix(x)
    xnames <- dimnames(x)[[2L]]
    ynames <- if(is.matrix(y)) rownames(y) else names(y)
    conv <- FALSE
    nobs <- NROW(y)
    nvars <- ncol(x)
    EMPTY <- nvars == 0
    ## define weights and offset if needed
    if (is.null(weights))
      weights <- rep.int(1, nobs)
    if (is.null(offset))
      offset <- rep.int(0, nobs)
    
    ## get family functions:
    variance <- family$variance
    linkinv  <- family$linkinv
    if (!is.function(variance) || !is.function(linkinv) )
      stop("'family' argument seems not to be a valid family object", call. = FALSE)
    dev.resids <- family$dev.resids
    aic <- family$aic
    mu.eta <- family$mu.eta
    unless.null <- function(x, if.null) if(is.null(x)) if.null else x
    valideta <- unless.null(family$valideta, function(eta) TRUE)
    validmu  <- unless.null(family$validmu,  function(mu) TRUE)
    if(is.null(mustart)) {
      ## calculates mustart and may change y and weights and set n (!)
      eval(family$initialize)
    } else {
      mukeep <- mustart
      eval(family$initialize)
      mustart <- mukeep
    }
    if(EMPTY) {
      eta <- rep.int(0, nobs) + offset
      if (!valideta(eta))
        stop("invalid linear predictor values in empty model", call. = FALSE)
      mu <- linkinv(eta)
      ## calculate initial deviance and coefficient
      if (!validmu(mu))
        stop("invalid fitted means in empty model", call. = FALSE)
      dev <- sum(dev.resids(y, mu, weights))
      w <- sqrt((weights * mu.eta(eta)^2)/variance(mu))
      residuals <- (y - mu)/mu.eta(eta)
      good <- rep_len(TRUE, length(residuals))
      boundary <- conv <- TRUE
      coef <- numeric()
      iter <- 0L
    } else {
      coefold <- NULL
      eta <-
        if(!is.null(etastart)) etastart
      else if(!is.null(start))
        if (length(start) != nvars)
          stop(gettextf("length of 'start' should equal %d and correspond to initial coefs for %s", nvars, paste(deparse(xnames), collapse=", ")),
               domain = NA)
      else {
        coefold <- start
        offset + as.vector(if (NCOL(x) == 1L) x * start else x %*% start)
      }
      else family$linkfun(mustart)
      mu <- linkinv(eta)
      if (!(validmu(mu) && valideta(eta)))
        stop("cannot find valid starting values: please specify some", call. = FALSE)
      ## calculate initial deviance and coefficient
      devold <- sum(dev.resids(y, mu, weights))
      boundary <- conv <- FALSE
      
      ##------------- THE Iteratively Reweighting L.S. iteration -----------
      for (iter in 1L:control$maxit) {
        good <- weights > 0
        varmu <- variance(mu)[good]
        if (anyNA(varmu))
          stop("NAs in V(mu)")
        if (any(varmu == 0))
          stop("0s in V(mu)")
        mu.eta.val <- mu.eta(eta)
        if (any(is.na(mu.eta.val[good])))
          stop("NAs in d(mu)/d(eta)")
        ## drop observations for which w will be zero
        good <- (weights > 0) & (mu.eta.val != 0)
        
        if (all(!good)) {
          conv <- FALSE
          warning(gettextf("no observations informative at iteration %d",
                           iter), domain = NA)
          break
        }
        z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
        w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
        # ## call Fortran code via C wrapper
        # fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
        #              min(1e-7, control$epsilon/1000), check=FALSE)
        # 
        
        #print(iter)
        #print(z)
        #print(w)
      }
    
      
      }
      return(list(z=z, w=w, mustart=mustart, etastart=etastart, eta=eta, offset=offset, mu=mu, mu.eta.val=mu.eta.val,
                  weight=weights, var_mu=variance(mu)))
    
    }