Search code examples
rdplyrtidyversemutate

How does dplyr::mutate apply changes to multiple columns within the same mutate command?


I have noticed when working with the mutate function that the result is different when I change multiple columns within the same mutate command compared to multiple mutate commands connected by pipes. I assumed that within a mutate command, the changes are made one after the other and the result of previous changes within the same mutate command is applied. Is this assumption wrong?

In the following example, a sumscore is to be calculated from the columns x_1, x_2 and x_3. The column y_1 is only intended to illustrate that there are other columns in the actual problem, which is why the commands cannot simply be applied to all columns.

Example Data:

library(tidyverse)

df <- tibble(
  x_1 = c(2, 3,4,5,4,3),
  x_2 = c(NA,4,2,3,4,1),
  x_3 = c(NA,2,3,1,3,NA),
  y_1 = c(3, 4,5,2,3,2)
)

When calculating the score, ONE missing can be replaced by the row mean of the other items. If more than one item is missing, the score cannot be calculated. To implement this: (1) The variable x_miscount is created, in which the number of missings is counted. (2) The row mean (x_rowmean) over the x columns is calculated. (3) Missings in every x column are replaced with the rowmean if the missing count is exactly 1. (4) The sumscore is calculated.

df %>% 
  mutate(x_miscount = rowSums(is.na(select(., x_1:x_3))),
         x_rowmean = rowMeans(select(., x_1:x_3), na.rm = T),
         across(.cols = c(x_1:x_3),
                .fns = ~if_else(is.na(.) & 
                                  x_miscount == 1,
                                x_rowmean, 
                                .)),
         x_score = rowSums(select(., x_1:x_3)))

However you can see in row 6 of the result, that the x_score column is NA, although the missing in column x_3 was correctly replaced by the rowmean.

If the score is calculated in a new mutate command everything works as intended:

df %>% 
  mutate(x_miscount = rowSums(is.na(select(., x_1:x_3))),
         x_rowmean = rowMeans(select(., x_1:x_3), na.rm = T),
         across(.cols = c(x_1:x_3),
                .fns = ~if_else(is.na(.) & 
                                  x_miscount == 1,
                                x_rowmean, 
                                .))) %>% 
  mutate(x_score = rowSums(select(., x_1:x_3)))

Does anybody know how dplyr::mutate makes changes if multiple columns should be changed within the same command? I don't get the difference between the two examples and I'm confused by the different results.


Solution

  • I believe the problem here is the use of select(., x_1:x_3), not a problem with the way mutate updates. The (.) placeholder refers to the object on the left-hand side of the %>% expression, which in this case is df before the application of mutate.

    You'll notice that this code:

    library(tidyverse)
    
    df %>% 
      mutate(
        x_miscount = rowSums(is.na(select(., x_1:x_3))),
        x_rowmean = rowMeans(select(., x_1:x_3), na.rm = T),
        across(
          .cols = c(x_1:x_3),
          .fns = ~ if_else(is.na(.) & x_miscount == 1, x_rowmean, .)
        ),
        x_score = x_1 + x_2 + x_3
      )
    

    produces the expected result, since it doesn't refer back to df with (.) when creating x_score.

    For a more general solution for arbitrarily many x_ variables, I'd suggest pivot_longer():

    df |> 
      mutate(id = row_number()) |> 
      pivot_longer(starts_with("x_")) |>
      mutate(
        x_miscount = sum(is.na(value)), 
        x_mean = mean(value, na.rm = TRUE),
        value = if_else(is.na(value) & x_miscount == 1, x_mean, value),
        x_score = sum(value),
        .by = id
      ) |> 
      pivot_wider() |> 
      select(!id) |> 
      relocate(x_1:x_3, .before = y_1)