So I am writing a function to create a specifc number of duplicate rows, from something like this:
df1 <- tibble(
Random_category = c(rep("A", 2), rep("B", 3), rep("C", 6)),
ID = 1:11,
Value = sample(1:100, 11, replace = TRUE)
)
Random_category ID Value
<chr> <int> <int>
1 A 1 92
2 A 2 11
3 B 3 42
4 B 4 33
5 B 5 93
6 C 6 79
7 C 7 82
8 C 8 46
9 C 9 77
10 C 10 88
11 C 11 58
To something like this:
Random_category ID Value
<chr> <int> <int>
1 A 2 60
2 A 2 60
3 A 1 8
4 A 2 60
5 A 1 8
6 B 3 31
7 B 4 13
8 B 4 13
9 B 5 91
10 B 5 91
11 C 6 19
12 C 9 72
13 C 7 26
14 C 10 85
15 C 8 67
My function looks like this:
duplicate_rows <- function(df, target_num_of_rows, group_name) {
df %>%
group_by({{group_name}}) %>%
mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
slice(rep(row_number(), times = rows_to_duplicate)) %>%
ungroup() %>%
select(-rows_to_duplicate) %>%
slice_sample(by = {{group_name}}, n = target_num_of_rows)
}
# Duplicate rows ensuring each group has exactly 5 rows
df_duplicated <- duplicate_rows(df1, 5, "Random_category")
But instead it gives me:
Random_category ID Value `"Random_category"`
<chr> <int> <int> <chr>
1 A 2 60 Random_category
2 A 1 8 Random_category
3 B 3 31 Random_category
4 B 4 13 Random_category
5 B 5 91 Random_category
Even though I have taken the dplyr section out from the function and it works perfectly:
df1 %>%
group_by(Random_category) %>%
mutate(rows_to_duplicate = if_else(row_number() <= 5, ceiling(5 / n()), 0)) %>%
slice(rep(row_number(), times = rows_to_duplicate)) %>%
ungroup() %>%
select(-rows_to_duplicate) %>%
slice_sample(by = Random_category, n = 5)
I suspect it is something to do with the group name, but I don't understand why?
Use backticks instead of quotes.
duplicate_rows(df1, 5, `Random_category`)
# # A tibble: 15 × 3
# Random_category ID Value
# <chr> <int> <int>
# 1 A 2 11
# 2 A 1 92
# 3 A 1 92
# 4 A 2 11
# 5 A 1 92
# 6 B 4 33
# 7 B 5 93
# 8 B 3 42
# 9 B 4 33
# 10 B 5 93
# 11 C 8 46
# 12 C 9 77
# 13 C 7 82
# 14 C 10 88
# 15 C 6 79
The use of {{..}}
should be working on symbols, not strings, so we need to pass it something compatible.
FYI, if you want it to be able to accept strings instead,
duplicate_rows <- function(df, target_num_of_rows, group_name) {
group_name <- sym(group_name)
df %>%
group_by({{group_name}}) %>%
mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
slice(rep(row_number(), times = rows_to_duplicate)) %>%
ungroup() %>%
select(-rows_to_duplicate) %>%
slice_sample(by = {{group_name}}, n = target_num_of_rows)
}
duplicate_rows(df1, 5, "Random_category")
# # A tibble: 15 × 3
# Random_category ID Value
# <chr> <int> <int>
# 1 A 1 92
# 2 A 1 92
# 3 A 2 11
# 4 A 1 92
# 5 A 2 11
# 6 B 5 93
# 7 B 4 33
# 8 B 4 33
# 9 B 5 93
# 10 B 3 42
# 11 C 9 77
# 12 C 8 46
# 13 C 7 82
# 14 C 10 88
# 15 C 6 79
... but now the use of symbols does not work.
duplicate_rows(df1, 5, `Random_category`)
# Error in datamart_write(copy(allouts)[, `:=`(MyCar, paste0("c", MyCar))], :
# object 'Random_category' not found
Choose whichever strategy makes the most sense to you.
Edit: @Onyambu suggested a way that handles both:
duplicate_rows <- function(df, target_num_of_rows, group_name) {
group_name <- as.name(as.character(substitute(group_name)))
df %>%
group_by({{group_name}}) %>%
mutate(rows_to_duplicate = if_else(row_number() <= target_num_of_rows, ceiling(target_num_of_rows / n()), 0)) %>%
slice(rep(row_number(), times = rows_to_duplicate)) %>%
ungroup() %>%
select(-rows_to_duplicate) %>%
slice_sample(by = {{group_name}}, n = target_num_of_rows)
}
duplicate_rows(df1, 5, "Random_category") # works
duplicate_rows(df1, 5, `Random_category`) # works
I like the fact that it is flexible, though I do believe that sometimes polymorphism can go too far. Not sure if this is one of those times ...