Search code examples
rloopsdataframenaivebayes

R - How to add columns to a dataset incrementally using a loop?


I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.

Here is the pseudocode of what I'm trying to achieve

START
IMPORT DATASET WITH ALL VARIABLES

num_variables = num_dataset_cols
i= 1

WHILE (i <= num_variables)
{
   CREATE NEW DATASET WITH x COLUMNs

   BUILD THE MODEL 
   GET THE ERROR RATE

   ADD IN NEXT COLUMN

   i = i + 1
}

Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.

#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")

dataset <- data.frame(col1, col2, col3, col4,col5)

num_variables <- ncol(dataset)

i <- 1

while i <= num_variables 
{
data <- dataset[c(1, i+1)]
str(data)

#BUILD MODEL AND GET VALIDATION ERROR

#INCREMENT i TO GET NEXT COLUMN
i <- i + 1

}

You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?


Solution

  • I think this is what you want.

    col1 <- c("A", "B", "C", "D", "E")
    col2 <- c(1,2,3,4,5)
    col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
    col4 <- c("n","y","y","n","y")
    col5 <- c("10", "15", "50", "100", "20")
    
    dataset <- data.frame(col1, col2, col3, col4,col5)
    dataset
    
    num_variables <- ncol(dataset)
    num_variables
    i <- 1
    
    while (i <= num_variables) {
    
    data <- dataset[, 1:i]
    
    print(str(data))
    
    #BUILD MODEL AND GET VALIDATION ERROR
    
    #INCREMENT i TO GET NEXT COLUMN
    i <- i + 1
    
    }
    
    Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
    NULL
    'data.frame':   5 obs. of  2 variables:
     $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
     $ col2: num  1 2 3 4 5
    NULL
    'data.frame':   5 obs. of  3 variables:
     $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
     $ col2: num  1 2 3 4 5
     $ col3: logi  TRUE FALSE FALSE TRUE FALSE
    NULL
    'data.frame':   5 obs. of  4 variables:
     $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
     $ col2: num  1 2 3 4 5
     $ col3: logi  TRUE FALSE FALSE TRUE FALSE
     $ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
    NULL
    'data.frame':   5 obs. of  5 variables:
     $ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
     $ col2: num  1 2 3 4 5
     $ col3: logi  TRUE FALSE FALSE TRUE FALSE
     $ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
     $ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
    NULL