writing group_by friendly functions

I'm building a function to add a validation coulomb to my data that I plan to model later. (tidymodel::initial.split() won't let me stratify deeply enough hence my diy approach)

the function is simple it will add a column (when used throughout mutate) that reads either "validation" or "testing" and will randomly apply these at a 0.25/075 ratio.

I have made the function fully work but it isn't perfect. the only way i can get it to see the group_by is to place a n() in my execution code

#edit
set.seed(100)

#exsample data
df<-tibble(x = seq(1,15,by=1)%>%rep(100),
           true = (seq(1,100,by=1)*100*runif(1))%>%rep(each = 15),  
           measurment = lapply(rep(1,100),
                               function(data) 
                               {data*rnorm(15, mean = 100*runif(1), sd = 2)}
           )%>%unlist(),
           z = c("a","b","c","d","e")%>%rep(20, each = 15),
           ID = paste0("a", 1:20)%>%rep(each = 15*5)
)



validation_column <- function(data) {
  # Generate a vector of labels based on the desired ratio
  labels <- c(rep("training", round(0.75 * data)), rep("validation", round(0.25 * data)))
  # Shuffle the labels
  labels <- sample(labels)
  return(labels)
}

df %>%
  group_by(ID,z) %>%
  mutate(validation = validation_column(n()))

I believe I need to incorporate some kind of if{} into my code to trigger a do() when I build the function but i cant figure out how exactly. this was my attempt based on https://www.r-bloggers.com/2018/07/writing-pipe-friendly-functions/ but it still doesn't work

any advice would be welcome so i can build beter functions in the future.

validation_column <- function(data) {
  if (dplyr::is_grouped_df(data)) {
    return(dplyr::do(data, ~ validation_column(.)))
  }
  
  # Generate a vector of labels based on the desired ratio
  labels <- c(rep("training", round(0.75 * n(data))), rep("validation", round(0.25 * n(data))))
  # Shuffle the labels
  labels <- sample(labels)
  return(labels)
}


df %>%
  group_by(ID,z) %>%
  mutate(validation = validation_column())

Solution

Your rep(..round(data)) needs a numeric input, a length.

Try this:

validation_column <- function(data) {
  
  # Generate a vector of labels based on the desired ratio
  labels <- c(
    rep("training",   round(0.75 * length(data))), # length(data) 
    rep("validation", round(0.25 * length(data)))) # length(data)
  
  # Shuffle the labels
  labels <- sample(labels)
  
  return(labels)
}

Pass a column as mutate input:

> mutate(df, validation = validation_column(z))
# A tibble: 1,500 × 6
       x  true measurment z     ID    validation
   <dbl> <dbl>      <dbl> <chr> <chr> <chr>     
 1     1  30.8       26.0 a     a1    training  
 2     2  30.8       25.6 a     a1    training  
 3     3  30.8       27.5 a     a1    training  
 4     4  30.8       26.0 a     a1    training  
 5     5  30.8       26.4 a     a1    validation
 6     6  30.8       24.6 a     a1    validation
 7     7  30.8       27.2 a     a1    validation
 8     8  30.8       24.1 a     a1    training  
 9     9  30.8       25.0 a     a1    training  
10    10  30.8       25.9 a     a1    training  
# ℹ 1,490 more rows
# ℹ Use `print(n = ...)` to see more rows

Edit. Sample data with set.seed:

set.seed(100)

df <-tibble(
  x = rep(seq(1,15, by=1), 100),
  true = rep((seq(1,100, by=1) * 100 * runif(1)), each = 15), 
  measurment = lapply(
    rep(1,100), 
    function(data){
      data*rnorm(15, mean = 100*runif(1), sd = 2)
      }) %>%
    unlist(),
  z = rep(c("a","b","c","d","e"), 20, each = 15),
  ID = rep(paste0("a", 1:20), each = 15*5))