Search code examples
rdataframedplyrnasummarize

How to handle groups with NA values when using map_dfr and summarize


I am trying to calculate slopes for multiple measures in a grouped dataset using lm() in R. However, some groups have all NA values for certain measures, which causes the following error:

Error in `map()`:
i In index: 2.
Caused by error in `summarize()`:
i In argument: `Slope = (lm(reformulate("nDay", measure)))$coefficients[2]`.
i In group 5: `Subject = 3`, `Response = x`.
Caused by error in `lm.fit()`:
! 0 (non-NA) cases
Run `rlang::last_trace()` to see where the error occurred.

I understand that this error occurs because some groups have all NA values, making it impossible to fit a linear model. However, I haven't been able to figure out how to handle these cases by returning NA for the slope instead of crashing.

Here is a minimal working example for my code:

library(dplyr)
library(tidyr)
library(purrr)

calculate_slope = function(df, measure) {
  df %>%
    summarize(Measure = measure,
              Slope = (lm(reformulate("nDay", measure)))$coefficients[2],
              .groups = "drop")
}

example_data = expand.grid(
                 Subject = 1:3,
                 Response = c("x", "y"),
                 nDay = 1:3
               ) %>%
               mutate(
                 A = runif(n(), 0, 1),
                 B = runif(n(), 0, 1),
                 C = runif(n(), 0, 1)
               )

# Set some values to NA
example_data = example_data %>%
  mutate(B = ifelse(Subject == 3, NA, B))

measures = c("A", "B", "C")
summary_data = map_dfr(measures, ~ example_data %>%
                         group_by(Subject, Response) %>%
                         calculate_slope(., .x)) %>%
               pivot_wider(names_from = Measure, values_from = Slope) %>%
               rename_with(~ paste0("slope_", .), -c(Subject, Response))

I have tried modifying the calculate_slope function to check for all NA values, but I cannot make it work properly because the grouping isn't preserved and then I get the same value across the full measure. The goal is to calculate the slopes for each measure (A, B, C) grouped by Subject and Response, and return NA for the slope if a group has all NA values and a model can't be fit.


Solution

  • You can use the condition handling function tryCatch(), and set error = function(e) NA to return NA when lm crashes.

    calculate_slope = function(df, measure) {
      df %>%
        summarize(Measure = measure,
                  Slope = tryCatch(lm(reformulate("nDay", measure))$coefficients[2],
                                   error = function(e) NA),
                  .groups = "drop")
    }
    

    I think the above modification should enable your example code to execute successfully. Below, I provide an alternative approach that achieves the exact same results as your code but uses more concise syntax (only depending on dplyr).

    measures <- c("A", "B", "C")
    
    example_data %>%
      group_by(Subject, Response) %>%
      summarise(across(all_of(measures),
                       ~ tryCatch(lm(.x ~ nDay)$coefficients[2], error = \(e) NA),
                       .names = "slope_{.col}"),
                .groups = "drop")
    
    # # A tibble: 6 × 5
    #   Subject Response  slope_A slope_B slope_C
    #     <int> <fct>       <dbl>   <dbl>   <dbl>
    # 1       1 x         0.195    0.318   -0.246
    # 2       1 y         0.00840  0.0513   0.105
    # 3       2 x        -0.108   -0.0261   0.321
    # 4       2 y        -0.347   -0.308    0.328
    # 5       3 x        -0.153   NA       -0.136
    # 6       3 y        -0.00175 NA       -0.146