I'm building a function to add a validation coulomb to my data that I plan to model later. (tidymodel::initial.split() won't let me stratify deeply enough hence my diy approach)
the function is simple it will add a column (when used throughout mutate) that reads either "validation" or "testing" and will randomly apply these at a 0.25/075 ratio.
I have made the function fully work but it isn't perfect. the only way i can get it to see the group_by is to place a n() in my execution code
#edit
set.seed(100)
#exsample data
df<-tibble(x = seq(1,15,by=1)%>%rep(100),
true = (seq(1,100,by=1)*100*runif(1))%>%rep(each = 15),
measurment = lapply(rep(1,100),
function(data)
{data*rnorm(15, mean = 100*runif(1), sd = 2)}
)%>%unlist(),
z = c("a","b","c","d","e")%>%rep(20, each = 15),
ID = paste0("a", 1:20)%>%rep(each = 15*5)
)
validation_column <- function(data) {
# Generate a vector of labels based on the desired ratio
labels <- c(rep("training", round(0.75 * data)), rep("validation", round(0.25 * data)))
# Shuffle the labels
labels <- sample(labels)
return(labels)
}
df %>%
group_by(ID,z) %>%
mutate(validation = validation_column(n()))
I believe I need to incorporate some kind of if{} into my code to trigger a do() when I build the function but i cant figure out how exactly. this was my attempt based on https://www.r-bloggers.com/2018/07/writing-pipe-friendly-functions/ but it still doesn't work
any advice would be welcome so i can build beter functions in the future.
validation_column <- function(data) {
if (dplyr::is_grouped_df(data)) {
return(dplyr::do(data, ~ validation_column(.)))
}
# Generate a vector of labels based on the desired ratio
labels <- c(rep("training", round(0.75 * n(data))), rep("validation", round(0.25 * n(data))))
# Shuffle the labels
labels <- sample(labels)
return(labels)
}
df %>%
group_by(ID,z) %>%
mutate(validation = validation_column())
Your rep(..round(data))
needs a numeric input, a length.
Try this:
validation_column <- function(data) {
# Generate a vector of labels based on the desired ratio
labels <- c(
rep("training", round(0.75 * length(data))), # length(data)
rep("validation", round(0.25 * length(data)))) # length(data)
# Shuffle the labels
labels <- sample(labels)
return(labels)
}
Pass a column as mutate
input:
> mutate(df, validation = validation_column(z))
# A tibble: 1,500 × 6
x true measurment z ID validation
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 1 30.8 26.0 a a1 training
2 2 30.8 25.6 a a1 training
3 3 30.8 27.5 a a1 training
4 4 30.8 26.0 a a1 training
5 5 30.8 26.4 a a1 validation
6 6 30.8 24.6 a a1 validation
7 7 30.8 27.2 a a1 validation
8 8 30.8 24.1 a a1 training
9 9 30.8 25.0 a a1 training
10 10 30.8 25.9 a a1 training
# ℹ 1,490 more rows
# ℹ Use `print(n = ...)` to see more rows
Edit. Sample data with set.seed
:
set.seed(100)
df <-tibble(
x = rep(seq(1,15, by=1), 100),
true = rep((seq(1,100, by=1) * 100 * runif(1)), each = 15),
measurment = lapply(
rep(1,100),
function(data){
data*rnorm(15, mean = 100*runif(1), sd = 2)
}) %>%
unlist(),
z = rep(c("a","b","c","d","e"), 20, each = 15),
ID = rep(paste0("a", 1:20), each = 15*5))