Search code examples
rfunctionsapply

How to sapply a vector on a user defined function in R


I have a user defined function called make_data for creating dataset. I need to generate 3 different datasets using the make _data and mu_1 <- seq(1:3). I don't know how to use sapply, since the make_data function has multiple arguments,

library(dplyr) # for `%>%` and `slice`
library(caret) # for createDataPartion
make_data <- function(n = 1000, p = 0.5, 
                  mu_0 = 0, mu_1 = 2, 
                  sigma_0 = 1,  sigma_1 = 1){



 y <- rbinom(n, 1, p)
  f_0 <- rnorm(n, mu_0, sigma_0)
  f_1 <- rnorm(n, mu_1, sigma_1)
  x <- ifelse(y == 1, f_1, f_0)

  test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)

  list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
       test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}

using sapply function

mu_1 <- seq(0, 3)
dat_3<- sapply(mu_1,make_data)

I am getting an error report as shown below.

Error in createDataPartition(y, times = 1, p = 0.5, list = FALSE) : y must have at least 2 data points.


Solution

  • Your error arises because your argument, mu_1 was being position-matched, not to the mu_1 in your make_data function, but rather to the n argument. To pass an argument to a "non-first" parameter in a function where all the other parameters have acceptable defaults in the definition, you need to encapsulate that "out of sequence" parameter in an anonymous function and then accept it as a named parameter:

     library(dplyr) # for `%>%` and `slice`
     library(caret) # for createDataPartion
     # your code here
     dat_3<- sapply(mu_1, function(param) make_data(mu_1=param))  #succeeds
    

    The n parameter is now the 1000 that you clearly intended it to be.

    str(dat_3)
    List of 8
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] 2.963 0.313 0.853 -1.154 -1.895 ...
      ..$ y: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 2 2 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] -1.288 1.245 -0.109 -0.794 0.11 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 1 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] -0.686 1.823 -0.052 1.189 -0.318 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 1 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] -0.623 0.311 1.298 0.848 1.17 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 2 1 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] 0.956 0.825 1.592 2.729 -0.299 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] 1.92059 3.29866 0.00569 0.38111 0.41855 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] 4.572 3.19 -0.598 3.744 0.463 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 2 ...
     $ :'data.frame':   500 obs. of  2 variables:
      ..$ x: num [1:500] 2.7439 -0.0985 -0.4698 -1.2808 0.6663 ...
      ..$ y: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
     - attr(*, "dim")= int [1:2] 2 4
     - attr(*, "dimnames")=List of 2
      ..$ : chr [1:2] "train" "test"
      ..$ : NULL
    

    That got rid of the error but the datasets did not get the names you intended. That was because sapply removed them due to its "simplification" process (which is the s in sapply). You should instead use lapply. This then gives you named dataframes and they will be embedded in a list structure that you can properly iterate over, rather than the "simplified" result from sapply:

      dat_3<- lapply(mu_1, function(x) make_data(mu_1=x))
    

    I started out thinking I would answer the question by deploying traceback() and showing how to debug and basically expand on the comments, but that got me nowhere. I realized that the actions of sapply/lapply on named objects was at the root of the problem. It's a stumbling block that has frustrated many new and old users of R. Only the values and not the names are passed to the function. The responsibility for properly accepting arguments for any but the first is left entirely to the user. And not even the names of values destined to the first argument make it through. When you "say" lapply(obj_name, FUN) ... it turns out that FUN doe NOT get obj_name but only the result of eval(objname).