Search code examples
rnapurrrglm

Extract NA Values in Summary Output of glm


I am running a logistic regression on many different subsets of a large data frame. To do so, I use the follwoing code (using dplyrand purrr):

# define model to be run
mod_fun <- function(df) {
  glm(presence ~ transect, data = df, family = "binomial")
}

# nest data and run model
mod.glm <- dat %>%
  nest(-c(region, fYear, species, road)) %>%
  mutate(model = map(data, mod_fun))

# define functions to extract model coefficients
b_fun <- function(mod) {
  coef(summary(mod))[2]
}
p_fun <- function(mod) {
  coef(summary(mod))[8]
}

# extract coefficients
slope<-mod.glm %>% group_by(species, region, fYear, road) %>%
  transmute(beta = map_dbl(model, b_fun),
            p_val = map_dbl(model, p_fun))

As you can see, I only want to extract the estimate and p-value of the slope (called transect). To do so, I use the indexing coef(summary(mod))[2]etc. The problem is that there are also some subsets in my data frame resulting in over-determined systems where some coefficients will be set to NA. Using coef(summary(mod))[2] extracts the 2nd value of the coef()output and as NAs are ignored in coef(), this will no longer be the estimate of transect I want to extract. So far I tried unsing coef(summary(mod_2), complete = TRUE)(--> nothing changes, NAs still not showed) and adressing the values directly coef(summary(mod_2), complete = TRUE)["transect","Estimate"](--> throws an error). Does anyone know how I could solve this issue?

What I tried so far:

# two example models; mod_2 will result in NAs
mod_1 <- glm(presence ~ transect, data = dat[dat$fYear == 1&  dat$species=="Plantago lanceolata",], family = "binomial")
mod_2 <- glm(presence ~ transect, data = dat[dat$fYear == 2&  dat$species=="Plantago lanceolata",], family = "binomial")

coef(summary(mod_1))[2] # works fine
coef(summary(mod_2))[2] # not the value I want

coef(summary(mod_1), complete = TRUE)["transect","Estimate"] # works fine
coef(summary(mod_2), complete = TRUE)["transect","Estimate"] # error

coef(summary(mod_2), complete = TRUE) # NAs for transect are still not displayed

summary(mod_2)$coefficients["transect","Estimate"] # is not working either

Data:

dput(dat)
structure(list(region = c("HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", "HWI", 
"HWI", "HWI", "HWI", "HWI", "HWI", "HWI"), road = c("MK", "MK", 
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", 
"MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", "MK", 
"MK", "MK", "MK", "MK", "MK"), transect = c(1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 4L, 4L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 
11L, 11L, 12L, 12L, 13L, 13L, 15L, 15L, 1L), fYear = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), species = c("Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata", 
"Poa pratensis", "Plantago lanceolata", "Poa pratensis", "Plantago lanceolata"
), presence = c(1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -29L), groups = structure(list(
    fYear = c(1L, 1L, 2L), road = c("MK", "MK", "MK"), species = c("Plantago lanceolata", 
    "Poa pratensis", "Plantago lanceolata"), .rows = list(c(1L, 
    3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L
    ), c(2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 22L, 24L, 
    26L, 28L), 29L)), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

Thanks for your help!


Solution

  • I don't see how to extract the NA rows from the coefficients table. The answer, instead, may be to add in the NA rows after extracting the elements you want. This can be done with complete().

    In this example I use broom::tidy() because it makes it easy to get coefficients, a measure of uncertainty, and results of tests. This means I have to filter out the intercepts after the fact, but you could certainly do something similar with your functions if you add a column to complete() on.

    library(purrr)
    library(tidyr)
    library(dplyr)
    
    mod.glm %>%
         group_by(species, region, fYear, road) %>%
         transmute(results = map(model, broom::tidy) ) %>%
         unnest(results) %>%
         complete(term = "transect") %>%
         filter(term != "(Intercept)") %>%
         ungroup()
    
    # A tibble: 3 x 9
      species          region fYear road  term    estimate std.error statistic p.value
      <chr>            <chr>  <int> <chr> <chr>      <dbl>     <dbl>     <dbl>   <dbl>
    1 Plantago lanceo~ HWI        1 MK    transe~  -42.7   31127.     -0.00137   0.999
    2 Plantago lanceo~ HWI        2 MK    transe~   NA        NA      NA        NA    
    3 Poa pratensis    HWI        1 MK    transe~   -0.206     0.188  -1.10      0.272