Search code examples
rfunction

writing group_by friendly functions


I'm building a function to add a validation coulomb to my data that I plan to model later. (tidymodel::initial.split() won't let me stratify deeply enough hence my diy approach)

the function is simple it will add a column (when used throughout mutate) that reads either "validation" or "testing" and will randomly apply these at a 0.25/075 ratio.

I have made the function fully work but it isn't perfect. the only way i can get it to see the group_by is to place a n() in my execution code

#edit
set.seed(100)

#exsample data
df<-tibble(x = seq(1,15,by=1)%>%rep(100),
           true = (seq(1,100,by=1)*100*runif(1))%>%rep(each = 15),  
           measurment = lapply(rep(1,100),
                               function(data) 
                               {data*rnorm(15, mean = 100*runif(1), sd = 2)}
           )%>%unlist(),
           z = c("a","b","c","d","e")%>%rep(20, each = 15),
           ID = paste0("a", 1:20)%>%rep(each = 15*5)
)



validation_column <- function(data) {
  # Generate a vector of labels based on the desired ratio
  labels <- c(rep("training", round(0.75 * data)), rep("validation", round(0.25 * data)))
  # Shuffle the labels
  labels <- sample(labels)
  return(labels)
}

df %>%
  group_by(ID,z) %>%
  mutate(validation = validation_column(n()))

I believe I need to incorporate some kind of if{} into my code to trigger a do() when I build the function but i cant figure out how exactly. this was my attempt based on https://www.r-bloggers.com/2018/07/writing-pipe-friendly-functions/ but it still doesn't work

any advice would be welcome so i can build beter functions in the future.

validation_column <- function(data) {
  if (dplyr::is_grouped_df(data)) {
    return(dplyr::do(data, ~ validation_column(.)))
  }
  
  # Generate a vector of labels based on the desired ratio
  labels <- c(rep("training", round(0.75 * n(data))), rep("validation", round(0.25 * n(data))))
  # Shuffle the labels
  labels <- sample(labels)
  return(labels)
}


df %>%
  group_by(ID,z) %>%
  mutate(validation = validation_column())

Solution

  • Your rep(..round(data)) needs a numeric input, a length.

    Try this:

    validation_column <- function(data) {
      
      # Generate a vector of labels based on the desired ratio
      labels <- c(
        rep("training",   round(0.75 * length(data))), # length(data) 
        rep("validation", round(0.25 * length(data)))) # length(data)
      
      # Shuffle the labels
      labels <- sample(labels)
      
      return(labels)
    }
    

    Pass a column as mutate input:

    > mutate(df, validation = validation_column(z))
    # A tibble: 1,500 × 6
           x  true measurment z     ID    validation
       <dbl> <dbl>      <dbl> <chr> <chr> <chr>     
     1     1  30.8       26.0 a     a1    training  
     2     2  30.8       25.6 a     a1    training  
     3     3  30.8       27.5 a     a1    training  
     4     4  30.8       26.0 a     a1    training  
     5     5  30.8       26.4 a     a1    validation
     6     6  30.8       24.6 a     a1    validation
     7     7  30.8       27.2 a     a1    validation
     8     8  30.8       24.1 a     a1    training  
     9     9  30.8       25.0 a     a1    training  
    10    10  30.8       25.9 a     a1    training  
    # ℹ 1,490 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    Edit. Sample data with set.seed:

    set.seed(100)
    
    df <-tibble(
      x = rep(seq(1,15, by=1), 100),
      true = rep((seq(1,100, by=1) * 100 * runif(1)), each = 15), 
      measurment = lapply(
        rep(1,100), 
        function(data){
          data*rnorm(15, mean = 100*runif(1), sd = 2)
          }) %>%
        unlist(),
      z = rep(c("a","b","c","d","e"), 20, each = 15),
      ID = rep(paste0("a", 1:20), each = 15*5))