I have a user defined function called make_data
for creating dataset.
I need to generate 3 different datasets using the make _data
and mu_1 <- seq(1:3)
. I don't know how to use sapply
, since the make_data
function has multiple arguments,
library(dplyr) # for `%>%` and `slice`
library(caret) # for createDataPartion
make_data <- function(n = 1000, p = 0.5,
mu_0 = 0, mu_1 = 2,
sigma_0 = 1, sigma_1 = 1){
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
mu_1 <- seq(0, 3)
dat_3<- sapply(mu_1,make_data)
I am getting an error report as shown below.
Error in
createDataPartition(y, times = 1, p = 0.5, list = FALSE)
: y must have at least 2 data points.
Your error arises because your argument, mu_1
was being position-matched, not to the mu_1
in your make_data
function, but rather to the n
argument. To pass an argument to a "non-first" parameter in a function where all the other parameters have acceptable defaults in the definition, you need to encapsulate that "out of sequence" parameter in an anonymous function and then accept it as a named parameter:
library(dplyr) # for `%>%` and `slice`
library(caret) # for createDataPartion
# your code here
dat_3<- sapply(mu_1, function(param) make_data(mu_1=param)) #succeeds
The n
parameter is now the 1000 that you clearly intended it to be.
str(dat_3)
List of 8
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] 2.963 0.313 0.853 -1.154 -1.895 ...
..$ y: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 2 2 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] -1.288 1.245 -0.109 -0.794 0.11 ...
..$ y: Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 1 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] -0.686 1.823 -0.052 1.189 -0.318 ...
..$ y: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 1 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] -0.623 0.311 1.298 0.848 1.17 ...
..$ y: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 2 1 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] 0.956 0.825 1.592 2.729 -0.299 ...
..$ y: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] 1.92059 3.29866 0.00569 0.38111 0.41855 ...
..$ y: Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] 4.572 3.19 -0.598 3.744 0.463 ...
..$ y: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 2 ...
$ :'data.frame': 500 obs. of 2 variables:
..$ x: num [1:500] 2.7439 -0.0985 -0.4698 -1.2808 0.6663 ...
..$ y: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dim")= int [1:2] 2 4
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "train" "test"
..$ : NULL
That got rid of the error but the datasets did not get the names you intended. That was because sapply
removed them due to its "simplification" process (which is the s
in sapply
). You should instead use lapply
. This then gives you named dataframes and they will be embedded in a list structure that you can properly iterate over, rather than the "simplified" result from sapply
:
dat_3<- lapply(mu_1, function(x) make_data(mu_1=x))
I started out thinking I would answer the question by deploying traceback()
and showing how to debug and basically expand on the comments, but that got me nowhere. I realized that the actions of sapply
/lapply
on named objects was at the root of the problem. It's a stumbling block that has frustrated many new and old users of R. Only the values and not the names are passed to the function. The responsibility for properly accepting arguments for any but the first is left entirely to the user. And not even the names of values destined to the first argument make it through. When you "say" lapply(obj_name, FUN)
... it turns out that FUN
doe NOT get obj_name
but only the result of eval(objname)
.