Search code examples
rdplyrpurrrtidy

Remove unused contrasts when making multiple linear models using R map


I am making linear models across a large dataset which is unbalanced (not all contrasts are present for all groupings). Is there an efficient way to ignore groupings where there are less than 2 contrasts? In the examples below testData1 represents a balanced dataset where the workflow works correctly. testData2 represents an unbalanced dataset which throws a contrast error.

aovFxn <- function(dat){
  lm(outcomeVar ~ predVar1, data = dat) %>%
    broom::tidy()
}

testData1 <- data.frame(
  groupVar = rep(c('a', 'b'), each = 12),
  predVar1 = c(rep(c('x', 'y', 'z'), each = 4, times = 2)),
  outcomeVar = sample(1:100, 24)
)

testData2 <- data.frame(
  groupVar = rep(c('a', 'b'), each = 12),
  predVar1 = c(rep(c('x', 'y', 'z'), each = 4),
               rep('x', 12)),
  outcomeVar = sample(1:100, 24)
)

testStats1 <- testData1 %>%
  nest(groupData = -groupVar) %>%
  mutate(df = purrr::map(groupData, aovFxn)) %>%
  unnest_legacy(df)

testStats2 <- testData2 %>%
  nest(groupData = -groupVar) %>%
  mutate(df = purrr::map(groupData, aovFxn)) %>%
  unnest_legacy(df)

Solution

  • We may use either tryCatch or purrr::possibly to return a desired value when there is an error

    library(dplyr)
    library(purrr)
    paovFxn <- possibly(aovFxn, otherwise = NULL)
    testData2 %>%
      nest(groupData = -groupVar) %>%
      mutate(df = purrr::map(groupData, paovFxn)) %>%
      unnest(df)%>%
      select(-groupData)
    

    -output

     A tibble: 3 × 6
      groupVar term        estimate std.error statistic p.value
      <chr>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
    1 a        (Intercept)    42.5       17.3    2.45    0.0367
    2 a        predVar1y      19.7       24.5    0.805   0.441 
    3 a        predVar1z       2.25      24.5    0.0917  0.929 
    

    Another option is to create an if condition

    testData2 %>% 
      nest(groupData = -groupVar) %>% 
      mutate(df = map(groupData, ~ if(n_distinct(.x$predVar1) > 1) aovFxn(.x)) ) %>% 
      unnest(df, keep_empty = TRUE) %>%
      select(-groupData)
    

    -output

    # A tibble: 4 × 6
      groupVar term        estimate std.error statistic p.value
      <chr>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
    1 a        (Intercept)    42.5       17.3    2.45    0.0367
    2 a        predVar1y      19.7       24.5    0.805   0.441 
    3 a        predVar1z       2.25      24.5    0.0917  0.929 
    4 b        <NA>           NA         NA     NA      NA     
    

    NOTE: If we don't use keep_empty = TRUE, it will be FALSE by default and the 'groupVar' 'b' row will not be there in the output