Search code examples
rfunctiondplyr

Why does custom function using dplyr give a different result to without function wrap?


So I am writing a function to create a specifc number of duplicate rows, from something like this:

df1 <- tibble(
  Random_category = c(rep("A", 2), rep("B", 3), rep("C", 6)),
  ID = 1:11,
  Value = sample(1:100, 11, replace = TRUE)
)

   Random_category    ID Value
   <chr>           <int> <int>
 1 A                   1    92
 2 A                   2    11
 3 B                   3    42
 4 B                   4    33
 5 B                   5    93
 6 C                   6    79
 7 C                   7    82
 8 C                   8    46
 9 C                   9    77
10 C                  10    88
11 C                  11    58

To something like this:


Random_category    ID Value
<chr>           <int> <int>
 1 A                   2    60
 2 A                   2    60
 3 A                   1     8
 4 A                   2    60
 5 A                   1     8
 6 B                   3    31
 7 B                   4    13
 8 B                   4    13
 9 B                   5    91
10 B                   5    91
11 C                   6    19
12 C                   9    72
13 C                   7    26
14 C                  10    85
15 C                   8    67

My function looks like this:

duplicate_rows <- function(df, target_num_of_rows, group_name) {
  df %>%
    group_by({{group_name}}) %>%
    mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
    slice(rep(row_number(), times = rows_to_duplicate)) %>%
    ungroup() %>%
    select(-rows_to_duplicate) %>%
    slice_sample(by = {{group_name}}, n = target_num_of_rows)
}

# Duplicate rows ensuring each group has exactly 5 rows
df_duplicated <- duplicate_rows(df1, 5, "Random_category")

But instead it gives me:

Random_category    ID Value `"Random_category"`
<chr>           <int> <int> <chr>
1 A                   2    60 Random_category
2 A                   1     8 Random_category
3 B                   3    31 Random_category
4 B                   4    13 Random_category
5 B                   5    91 Random_category

Even though I have taken the dplyr section out from the function and it works perfectly:

df1 %>%
  group_by(Random_category) %>%
  mutate(rows_to_duplicate = if_else(row_number() <= 5, ceiling(5 / n()), 0)) %>%
  slice(rep(row_number(), times = rows_to_duplicate)) %>%
  ungroup() %>%
  select(-rows_to_duplicate) %>%
  slice_sample(by = Random_category, n = 5)

I suspect it is something to do with the group name, but I don't understand why?


Solution

  • Use backticks instead of quotes.

    duplicate_rows(df1, 5, `Random_category`)
    # # A tibble: 15 × 3
    #    Random_category    ID Value
    #    <chr>           <int> <int>
    #  1 A                   2    11
    #  2 A                   1    92
    #  3 A                   1    92
    #  4 A                   2    11
    #  5 A                   1    92
    #  6 B                   4    33
    #  7 B                   5    93
    #  8 B                   3    42
    #  9 B                   4    33
    # 10 B                   5    93
    # 11 C                   8    46
    # 12 C                   9    77
    # 13 C                   7    82
    # 14 C                  10    88
    # 15 C                   6    79
    

    The use of {{..}} should be working on symbols, not strings, so we need to pass it something compatible.

    FYI, if you want it to be able to accept strings instead,

    duplicate_rows <- function(df, target_num_of_rows, group_name) {
      group_name <- sym(group_name)
      df %>%
        group_by({{group_name}}) %>%
        mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
        slice(rep(row_number(), times = rows_to_duplicate)) %>%
        ungroup() %>%
        select(-rows_to_duplicate) %>%
        slice_sample(by = {{group_name}}, n = target_num_of_rows)
    }
    duplicate_rows(df1, 5, "Random_category")
    # # A tibble: 15 × 3
    #    Random_category    ID Value
    #    <chr>           <int> <int>
    #  1 A                   1    92
    #  2 A                   1    92
    #  3 A                   2    11
    #  4 A                   1    92
    #  5 A                   2    11
    #  6 B                   5    93
    #  7 B                   4    33
    #  8 B                   4    33
    #  9 B                   5    93
    # 10 B                   3    42
    # 11 C                   9    77
    # 12 C                   8    46
    # 13 C                   7    82
    # 14 C                  10    88
    # 15 C                   6    79
    

    ... but now the use of symbols does not work.

    duplicate_rows(df1, 5, `Random_category`)
    # Error in datamart_write(copy(allouts)[, `:=`(MyCar, paste0("c", MyCar))],  : 
    #   object 'Random_category' not found
    

    Choose whichever strategy makes the most sense to you.


    Edit: @Onyambu suggested a way that handles both:

    duplicate_rows <- function(df, target_num_of_rows, group_name) {
      group_name <- as.name(as.character(substitute(group_name)))
      df %>%
        group_by({{group_name}}) %>%
        mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
        slice(rep(row_number(), times = rows_to_duplicate)) %>%
        ungroup() %>%
        select(-rows_to_duplicate) %>%
        slice_sample(by = {{group_name}}, n = target_num_of_rows)
    }
    duplicate_rows(df1, 5, "Random_category") # works
    duplicate_rows(df1, 5, `Random_category`) # works
    

    I like the fact that it is flexible, though I do believe that sometimes polymorphism can go too far. Not sure if this is one of those times ...