Search code examples
rdplyrnasummarizeacross

Summarizing data: Error in summarize() across() all_of() dynamic_list_of_vars => Selections can't have missing values


TL;DR: summarize(across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE))) requires a non NA list of vars_to_sum, while I'm working inside a function and the user will sometimes ask for no variable to be summarized by a sum. The summarize() includes other across() and I'm not sure it is solvable by a simple if().

At least that what I think the problem is...


Hi!

I'm working on a project that involves quite a bit of data management/engineering, and there's one step that I struggle with. I'll replicate the problem on fake data.

I'm working on hospital stays. I started from relational databases, and I'm slowly summarizing the information in a format that fits my needs. Where my problem starts, the data will look something like this :

df <- data.frame(
patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
entry_date = as.POSIXt("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
exit_date = as.POSIXt("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
drug_A = c(0,0,1,1,0,1,0,0),
drug_B = c(0,1,1,0,0,0,1,1),
drug_C = c(1,2,5,1,0,0,1,0)
)

N.B: the database will have plenty more drug_xxx variables

My goal here is to develop a function that would

(1) identify patients that did several stays in a short period of time (started a hospital stay soon after the exit of another one, N.B: overlapping stays also exist), since we will consider that all such stays are related to the same medical issue (let's say cardiac arrest). If several hospitals/wards/doctors/... notify a cardiac arrest from the same person several days in a row, it's treated as 1 cardiac arrest, followed by several stays/consultations/... all related to the same cardiac arrest.

(2) summarizing the info of nearby stays by maxing the variables that should be maxed (defined by user), summing what should be summed, considering the earliest or latest datapoint when appropriate, and keeping the rest untouched

The function would consider the following arguments:

  • df, a dataframe
  • individual_id, a string indicating the name of the variable that identifies individuals
  • entry_date
  • exit_date
  • allowed_dist, the max distance between the exit date of a stay and the entry date of another one
  • vars_to_max, names of variables to be maxed
  • vars_to_sum, names of variables to be summed
  • vars_to_earliest, names of variables for which the earliest info is of interest
  • vars_to_latest, names of variables for which the latest info is of interest

Ideally, the function should work even if var_to_xxx arguments are NA, implying there's no variable to be sumrized this way on this occasion.

[EDIT: I initially shared my functional function, mixing dplyr and for-loops, but I think it just made the post less readable, so here goes a function that almost work and that needs fixing]

aggregate_stays <- function (df, individual_id, date_entry, date_exit, allowed_dist=0, vars_to_earliest=NA, vars_to_latest=NA, vars_to_sum=NA, vars_to_max=NA){
  
  require("lubridate",quietly=TRUE)
  require("dplyr",quietly=TRUE)
  
  # Distinguishing between single stays and multiple stays
  df_single_stays <- df %>%
    group_by(!!sym(individual_id)) %>%
    filter(n() == 1) %>%
    ungroup()
  df_multiple_stays <- df %>%
    group_by(!!sym(individual_id)) %>%
    filter(n() > 1) %>%
    ungroup()
  
  # Summarizing multiple stays according to the arguments of the function
  df_multiple_stays <- df_multiple_stays %>%
    arrange(!!sym(individual_id), !!sym(date_entry)) %>% # Ordering the individuals and stays
    group_by(!!sym(individual_id)) %>%
    mutate(wholestay_id = cumsum(!!sym(date_entry) - lag(!!sym(date_exit), default = first(!!sym(date_exit))) > allowed_dist)) %>% # Checking if nearby stays
    group_by(!!sym(individual_id), wholestay_id) %>%
    mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
    ungroup() %>%
    group_by(wholestay_id) %>%
    summarize(wholestay_entry = min(!!sym(date_entry)),
              wholestay_exit = max(!!sym(date_exit)),
              across(all_of(vars_to_earliest), ~ .[which.min(!!sym(date_entry))]),
              across(all_of(vars_to_latest), ~ .[which.max(!!sym(date_exit))]),
              across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)),
              across(all_of(vars_to_max), ~ max(., na.rm=TRUE))) %>%
    ungroup()
  
  
  # Removing the `wholestay_id` variable that was not asked by user
  df_multiple_stays <- df_multiple_stays %>%
    dplyr::select(-wholestay_id)
  
  df_final <- bind_rows(df_multiple_stays,df_single_stays)
  
  return(df_final)
}

But it doesn't seem to enjoy NA as argument for vars_to_xxx arguments

df2 <- aggregate_stays_wip(df = df,
                           individual_id = "patient_id",
                           date_entry = "entry_date",
                           date_exit = "exit_date",
                           allowed_dist = 10,
                           vars_to_sum = "drug_C",
                           vars_to_max = c("drug_A","drug_B"),
                           vars_to_earliest = "entry_mode",
                           vars_to_latest = c("patient_id","exit_mode"))

works fine

but

df3 <- aggregate_stays(df = df,
                       individual_id = "patient_id",
                       date_entry = "entry_date",
                       date_exit = "exit_date",
                       allowed_dist = 10,
                       vars_to_sum = NA,
                       vars_to_max = c("drug_A","drug_B","drug_C"),
                       vars_to_earliest = "entry_mode",
                       vars_to_latest = c("patient_id","exit_mode"))

returns

Error in `summarize()`:
ℹ In argument: `across(all_of(vars_to_sum), ~sum(., na.rm = TRUE))`.
Caused by error in `across()`:
! Selections can't have missing values.
Run `rlang::last_trace()` to see where the error occurred.

If you have any way to solve this, I'd love to hear it!

Best, an epidemiologist who whish he'd be better at coding.


Solution

  • Pass NULL instead and it then skips over that line of the summarise call.

    You can also perhaps make things easier by passing NULL as default and using tidy evaluation to avoid having to do quotes:

    library(tidyverse)
    
    aggregate_things <- function(
          data,
          vars_to_sum = NULL,
          vars_to_mean = NULL,
          vars_to_max = NULL
    ) {
      
      data |> 
        summarise(
          across({{vars_to_sum}}, sum),
          across({{vars_to_mean}}, mean),
          across({{vars_to_max}}, max),
        )
    }
    
    iris |> 
      aggregate_things(vars_to_sum = c(Sepal.Length, Sepal.Width),
                       vars_to_max = c(Petal.Length, Petal.Width))
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> 1        876.5       458.6          6.9         2.5
    

    Modifying your code to take NULLs

    library(tidyverse)
    
    df <- data.frame(
      patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
      entry_date = ymd("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
      exit_date = ymd("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
      entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
      exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
      drug_A = c("0","0","1","1","0","1","0","0"),
      drug_B = c("0","1","1","0","0","0","1","1"),
      drug_C = c("1","2","5","1","0","0","1","0")
    ) |> 
      mutate(across(starts_with("drug"), as.integer))
    
    
    aggregate_stays <-
      function (df,
                individual_id,
                date_entry,
                date_exit,
                allowed_dist = 0,
                vars_to_earliest = NULL,
                vars_to_latest = NULL,
                vars_to_sum = NULL,
                vars_to_max = NULL) {
        
        require("lubridate",quietly=TRUE)
        require("dplyr",quietly=TRUE)
    
      # Distinguishing between single stays and multiple stays
      df_single_stays <- df %>%
        group_by({{individual_id}}) %>%
        filter(n() == 1) %>%
        ungroup()
      
      df_multiple_stays <- df %>%
        group_by({{individual_id}}) %>%
        filter(n() > 1) %>%
        ungroup()
    
      # Summarizing multiple stays according to the arguments of the function
      df_multiple_stays <- df_multiple_stays %>%
        arrange({{individual_id}}, {{date_entry}}) %>% # Ordering the individuals and stays
        group_by({{individual_id}}) %>%
        mutate(wholestay_id = cumsum({{date_entry}} - lag({{date_exit}}, default = first({{date_exit}})) > allowed_dist)) %>% # Checking if nearby stays
        group_by({{individual_id}}, wholestay_id) %>%
        mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
        ungroup() %>%
        group_by(wholestay_id) %>%
        summarize(wholestay_entry = min({{date_entry}}),
                  wholestay_exit = max({{date_exit}}),
                  across({{vars_to_earliest}}, ~ .[which.min({{date_entry}})]),
                  across({{vars_to_latest}}, ~ .[which.max({{date_exit}})]),
                  across({{vars_to_sum}}, ~ sum(., na.rm=TRUE)),
                  across({{vars_to_max}}, ~ max(., na.rm=TRUE))) %>%
        ungroup()
      
      
      # Removing the `wholestay_id` variable that was not asked by user
      df_multiple_stays <- df_multiple_stays %>%
        dplyr::select(-wholestay_id)
      
      df_final <- bind_rows(df_multiple_stays,df_single_stays)
      
      return(df_final)
    }
    
    aggregate_stays(df = df,
                        individual_id = patient_id,
                        date_entry = entry_date,
                        date_exit = exit_date,
                        allowed_dist = 10,
                        vars_to_sum = drug_C,
                        vars_to_max = c(drug_A,drug_B),
                        vars_to_earliest = entry_mode,
                        vars_to_latest = c(patient_id,exit_mode))
    #> # A tibble: 5 × 10
    #>   wholestay_entry wholestay_exit entry_mode patient_id exit_mode  drug_C drug_A
    #>   <date>          <date>         <chr>      <chr>      <chr>       <int>  <int>
    #> 1 2020-02-01      2020-02-10     home       Bryan      home            7      1
    #> 2 2020-03-01      2020-03-10     home       Charlotte  discharged      1      1
    #> 3 2020-04-01      2020-04-10     home       Charlotte  home            0      0
    #> 4 2020-05-01      2020-06-10     transfer   Denis      transfer        1      1
    #> 5 NA              NA             home       Anne       death           1      0
    #> # ℹ 3 more variables: drug_B <int>, entry_date <date>, exit_date <date>
    
    aggregate_stays(df = df,
                    individual_id = patient_id,
                    date_entry = entry_date,
                    date_exit = exit_date,
                    allowed_dist = 10,
                    vars_to_max = c(drug_A,drug_B,drug_C),
                    vars_to_earliest = entry_mode,
                    vars_to_latest = c(patient_id,exit_mode))
    #> # A tibble: 5 × 10
    #>   wholestay_entry wholestay_exit entry_mode patient_id exit_mode  drug_A drug_B
    #>   <date>          <date>         <chr>      <chr>      <chr>       <int>  <int>
    #> 1 2020-02-01      2020-02-10     home       Bryan      home            1      1
    #> 2 2020-03-01      2020-03-10     home       Charlotte  discharged      1      0
    #> 3 2020-04-01      2020-04-10     home       Charlotte  home            0      0
    #> 4 2020-05-01      2020-06-10     transfer   Denis      transfer        1      1
    #> 5 NA              NA             home       Anne       death           0      0
    #> # ℹ 3 more variables: drug_C <int>, entry_date <date>, exit_date <date>
    

    This should now take your data and give the intended result, with the added advantage of if you pipe your dataframe into the function in RStudio then tab autocomplete will look up the columns in your dataframe for you and pass them as unquoted column names.