Search code examples
rsubsetlogistic-regressionglm

Removing completely separated observations from glm()


I'm doing a bit of exploratory data analysis using HMDA data from the AER package; however, the variables that I used to fit the model seem to contain some observations that perfectly determine the outcomes, an issue known as "separation." So I tried to remedy this using the solution recommended by this thread, yet when I tried to execute the first set of source code from glm.fit(), R returned an error message:

Error in family$family : object of type 'closure' is not subsettable

so I could not proceed any further to remove those fully determined observations from my data with this code. I am wondering if anyone could help me fix this?

My current code is provided at below for your reference.

# load the AER package and HMDA data
library(AER)
data(HMDA)

# fit a 2-degree olynomial probit model 
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA)

# using the revised source code from that stackexchage thread to find out observations that received a warning message
library(tidyverse)
library(dplyr)
library(broom)

eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
  if (any(mu > 1 - eps) || any(mu < eps)) 
    warning("glm.fit: fitted probabilities numerically 0 or 1 occurred", 
            call. = FALSE)
}

# this return the following error message
# Error in family$family : object of type 'closure' is not subsettable


probit.resids <- augment(probit.fit) %>%
  mutate(p = 1 / (1 + exp(-.fitted)),
         warning = p > 1-eps)

arrange(probit.resids, desc(.fitted)) %>%  
  select(2:5, p, warning) %>% 
  slice(1:10)


HMDA.nwarning <- filter(HMDA, !probit.resids$warning)

# using HMDA.nwarning should solve the problem...
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA.nwarning)


Solution

  • This chunk of code

    if (family$family == "binomial") {
      if (any(mu > 1 - eps) || any(mu < eps)) 
        warning("glm.fit: fitted probabilities numerically 0 or 1 occurred", 
                call. = FALSE)
    }
    

    there is a function, binomial() called when you run glm with family == "binomial". If you look under glm (just type glm):

    if (is.character(family)) 
            family <- get(family, mode = "function", envir = parent.frame())
        if (is.function(family)) 
            family <- family()
        if (is.null(family$family)) {
            print(family)
            stop("'family' not recognized")
        }
    

    And the glm function checks binomial()$family during the fit, and if any of the predicted values differ from 1 or 0 by eps, it raises that warning.

    You don't need to run that part, and yes, you need to set eps <- 10 * .Machine$double.eps . So let's run the code below, and if you run a probit, you need to specify link="probit" in binomial, otherwise the default is logit:

    library(AER)
    library(tidyverse)
    library(dplyr)
    library(broom)
    
    data(HMDA)
    
    probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA)
    
    eps <- 10 * .Machine$double.eps
    
    probit.resids <- augment(probit.fit) %>%
      mutate(p = 1 / (1 + exp(-.fitted)),
             warning = p > 1-eps)
    

    The column warning indicates if the observations raises a warning, in this dataset, there's one:

    table(probit.resids$warning)
    
    FALSE  TRUE 
     2379     1
    

    We can use the next step to filter it

    HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
    dim(HMDA.nwarning)
    [1] 2379   14
    

    And rerun the regression:

    probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA.nwarning)
    coefficients(probit.fit)
    (Intercept) poly(hirat, 2)1 poly(hirat, 2)2 
          -1.191292        8.708494        6.884404