Search code examples
rdplyracross

Correct syntax for mutate across when excluding columns, part 2


I thought I had the answer to my question here, but when I used with my larger data set I get different results. I suspect the difference is because of the way the na.locf line is acting.

Basically I am converting code where I used mutate_at to the new syntax with mutate(across()).

In the first case below, the data is filled correctly, because df_initial is still grouped by the index_name. In the second case, I'm assuming because I had to ungroup for the mutate across to work, I get a different answer.

So here is another example with a larger data set to illustrate the problem.

Reproducible example:

df_initial <- 
structure(list(Date = structure(c(18681, 18681, 18681, 18681, 
                                  18682, 18682, 18682, 18682, 18683, 18683, 18683, 18683, 18684, 
                                  18684, 18684, 18684, 18685, 18685, 18685, 18685, 18686, 18686, 
                                  18686, 18686), class = "Date"), index_name = c("INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", 
                                                                                 "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", 
                                                                                 "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", 
                                                                                 "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index"), index_level = c(31537.35, 
                                                                                                                                            3881.37, NA, 45268.33, 31961.86, 3925.43, 1903.07, 45151.38, 
                                                                                                                                            31402.01, 3829.34, 1926.23, 44310.27, 30932.37, 3811.15, 1864.49, 
                                                                                                                                            44592.91, NA, NA, NA, NA, NA, NA, NA, NA), totalReturn_daily = c(0.0497, 
                                                                                                                                                                                                             0.1277, 0, 0.7158, 1.3461, 1.1364, -1.8201, -0.1151, -1.7181, 
                                                                                                                                                                                                             -2.4339, 1.2411, -1.8629, -1.4628, -0.4636, -3.2052, 0.6379, 
                                                                                                                                                                                                             0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -24L), groups = structure(list(
                                                                                                                                                                                                               index_name = c("INDU Index", "MEXBOL Index", "SPX Index", 
                                                                                                                                                                                                                              "TPX Index"), .rows = structure(list(c(1L, 5L, 9L, 13L, 17L, 
                                                                                                                                                                                                                                                                     21L), c(4L, 8L, 12L, 16L, 20L, 24L), c(2L, 6L, 10L, 14L, 
                                                                                                                                                                                                                                                                                                            18L, 22L), c(3L, 7L, 11L, 15L, 19L, 23L)), ptype = integer(0), class = c("vctrs_list_of", 
                                                                                                                                                                                                                                                                                                                                                                                     "vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               "tbl_df", "tbl", "data.frame"))

The first approach below gives the correct values, but the second approach below does not. So I am trying to get the same answer in approach #2 and I get in approach #1.

# Approach 1: Expected output received here:
df_initial %>%
  mutate_at(vars(-index_name, -totalReturn_daily),
            ~ na.locf(., na.rm = FALSE)) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index          NA               0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index        1864.              0   
6 2021-02-28 TPX Index        1864.              0  

# Approach 2: Did not receive expected output here
df_initial %>%
  ungroup() %>%
  mutate(across(
    .cols = -c(index_name, totalReturn_daily),
    .fns  = ~ na.locf(., na.rm = FALSE)
  )) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index        3881.              0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index       44593.              0   
6 2021-02-28 TPX Index       44593.              0  

Thanks!


Solution

  • Both approaches gave similar results for me. Could you try the code below?

    library(zoo)
    df_initial %>%
      group_by(index_name) %>% 
      mutate_at(vars(-index_name, -totalReturn_daily),
                ~ na.locf(., na.rm = FALSE)) %>% 
      dplyr::filter(index_name == "TPX Index") 
    
    
    df_initial %>%
      group_by(index_name) %>% 
      mutate(across(
        .cols = -c(totalReturn_daily),
        .fns  = ~ na.locf(., na.rm = FALSE)
      )) %>%
      ungroup() %>% 
      dplyr::filter(index_name == "TPX Index")