Search code examples
rdplyrcaserowmutate

Using case_when with dplyr mutate: why is sometimes the value per row used for calculation and sometimes the whole column?


I want to create a new variable when two variables match a condition. In that case I want to have the smaller of two other variables. In my real data these are dates and I want to add other conditions in the case_when function, but I keep it simple her.

When I use first(sort(c(var1,var2))) the variables are interpreted as the whole column and it results in the smallest value for var1, var2 for the entire data.frame. But I want the smallest of the two variables per row in the data. In contrast, when I use var1 * var2 the variables per row in the data are used.

I tried the following:

library(dplyr)
    starwars %>% 
      mutate(
        new_var = 
          case_when(
           hair_color == "brown" & eye_color == "brown" ~ first(sort(c(birth_year, mass)))
          ),
        new_var2 = 
          case_when(
            hair_color == "brown" & eye_color == "brown" ~ birth_year * mass 
          )
       ) %>% 
      select ( name, hair_color, eye_color, mass, birth_year, new_var, new_var2) %>% 
      filter(hair_color == "brown" & eye_color == "brown")

shortend output:

 # A tibble: 9 × 7
  name                  hair_color eye_color  mass birth_year new_var new_var2
  <chr>                 <chr>      <chr>     <dbl>      <dbl>   <dbl>    <dbl>
1 Leia Organa           brown      brown        49         19       8      931
2 Han Solo              brown      brown        80         29       8     2320
3 Arvel Crynyd          brown      brown        NA         NA       8       NA
4 Wicket Systri Warrick brown      brown        20          8       8      160

Thanks a lot in advance


Solution

  • For functions that summarise information, like first, using the function without grouping will result in a calculation over the entire dataframe. The most direct remedy is to use rowwise with c_across.

    But doing this results in slower code, so I'd suggest using pmin:

    library(dplyr)
    library(purrr)
    starwars %>% 
      mutate(new_var = 
               case_when(hair_color == "brown" & eye_color == "brown" ~ 
                           pmin(birth_year, mass))) %>% 
      select (name, hair_color, eye_color, mass, birth_year, new_var)
    

    output

    # A tibble: 87 × 6
       name               hair_color    eye_color  mass birth_year new_var
       <chr>              <chr>         <chr>     <dbl>      <dbl>   <dbl>
     1 Luke Skywalker     blond         blue         77       19        NA
     2 C-3PO              NA            yellow       75      112        NA
     3 R2-D2              NA            red          32       33        NA
     4 Darth Vader        none          yellow      136       41.9      NA
     5 Leia Organa        brown         brown        49       19        19
     6 Owen Lars          brown, grey   blue        120       52        NA
     7 Beru Whitesun lars brown         blue         75       47        NA
     8 R5-D4              NA            red          32       NA        NA
     9 Biggs Darklighter  black         brown        84       24        NA
    10 Obi-Wan Kenobi     auburn, white blue-gray    77       57        NA
    

    With c_across + rowwise. This is not an optimal solution. You could use pmin to get the minimum value per row instead (see above).

    starwars %>% 
      rowwise() %>% 
      mutate(new_var = 
               case_when(hair_color == "brown" & eye_color == "brown" ~ 
                           first(sort(c_across(c(birth_year, mass)))))) %>% 
      select (name, hair_color, eye_color, mass, birth_year, new_var)
    

    If you need the calculations only on those with the conditions, you can be even simpler and avoid case_when, using pmin:

    starwars %>% 
      filter(hair_color == "brown" & eye_color == "brown") %>% 
      mutate(new_var = pmin(birth_year, mass)))