Search code examples
rfunctiondataframestatistical-sampling

Multiple sampling inside an R function


I am trying to make a function that in the end will run multiple machine learning algorithms on my data set. I have the first little bit of my function below and a small sample of data.

The problem i am running into is with sampling my data into four different data frames and then applying them to the given functions. Here on the first function i am testing the data runs threw the logistic regression model but on the output it uses all the data for that model and not just 1/4 of the data frame df as i am intending. I checked with <<- to see what kind of data is being passed threw and it sends a data set that is 1/4 of the data frame df that i am looking for. Question why douse it pass to my global environment the right way but not my regression function and how would i correct this?

Data:

zeroFac <- c(1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1)

goal <- c(8.412055,  7.528869,  8.699681, 10.478752,  9.210440, 10.308986, 10.126671, 11.002117, 10.308986,  7.090910, 10.819798,  7.824446,  8.612685,
7.601402, 10.126671,  7.313887,  5.993961,  7.313887,  8.517393, 12.611541)

City_Pop <- c( 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613,
11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613)

df <- data.frame(zeroFac,goal,City_Pop)

Function:

forestModel <- function(eq1, ...){

  #making our origenal data frame
  train <- data.frame(cbind(...))

  ################

    #splitting into 4 data sets
    set.seed(123)

    ss <- sample(1:4, size = nrow(train), replace=TRUE, prob = c(0.25,0.25,0.25,0.25))

    t1 <- train[ss==1,]
    t2 <- train[ss==2,]
    t3 <- train[ss==3,]
    t4 <- train[ss==4,]

  ################

  m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
  summary(m)

}

eq1 <- df$zeroFac ~ df$goal + df$City_Pop


forestModel(eq1, df$zeroFac, df$goal, df$City_Pop)

Solution

  • You have to change the formula and name the columns of the train dataset in the function. The equation changes from eq1 <- df$zeroFac ~ df$goal + df$City_Pop to eq1 <- zeroFac ~ goal + City_Pop. Otherwise it also contains the call to the dataframe and not just to the column names. And after binding the train-data together, you have to name their columns, so the glm function knows which columns you are reffering to in the equation.

     forestModel <- function(eq1, ...){
    
      #making our origenal data frame
    
      train <- data.frame(cbind(...))
      colNames <- colnames(data.frame(...))
      coln <- do.call(cbind, lapply(X = strsplit(colNames, "\\."), FUN = function(X) X[[2]]))
      colnames(train) <- coln
    
      ################
    
      #splitting into 4 data sets
      set.seed(123)
    
      ss <- sample(1:4, size = nrow(train), replace=TRUE, prob = c(0.25,0.25,0.25,0.25))
    
      t1 <- train[ss==1,]
      ################
    
      m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
      summary(m)
    }
    
    eq1 <- zeroFac ~ goal + City_Pop
    forestModel(eq1, df$zeroFac, df$goal, df$City_Pop)