Search code examples
rrbindlevels

r - levels added to dataframe, why?


This post is to better understand how "levels" work in R. Indeed, other answers were not fully explanatory (see for example this).

Consider the following short script, where I calculate the RMSE of each column of a random dataframe df and store the value as a row of a new dataframe bestcombo

df = as.data.frame(matrix(rbinom(10*1000, 1, .5), nrow = 10, ncol=5))

#generate empty dataframe and assign col names
bestcombo = data.frame(matrix(ncol = 2, nrow = 0))
colnames(bestcombo) = c("RMSE", "Row Number")

#for each col of df calculate RMSE and store together with col name
for (i in 1:5){
  RMSE = sqrt(mean(df[,i] ^ 2))
  row_num = i

  row = as.data.frame(cbind( RMSE, toString(row_num) ))
  colnames(row) = c("RMSE", "Row Number")
  bestcombo = rbind(bestcombo, row)
}

The problem is that "Levels" are generated. Why?

bestcombo$RMSE
             RMSE              RMSE              RMSE              RMSE              RMSE 
0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076 0.707106781186548 
Levels: 0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076

bestcombo$RMSE[1]
             RMSE 
0.547722557505166 
Levels: 0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076

Why is this happening and how to avoid it? Is this due to a wrong use of rbind()?

This also produces other problems. For example, the order function does not work.

bestcombo[order(bestcombo$RMSE),]

               RMSE Random Vector
1 0.547722557505166             1
2 0.774596669241483             2
3 0.707106781186548             3
5 0.707106781186548             5
4 0.836660026534076             4

Solution

  • You want something more like this:

    #for each col of df calculate RMSE and store together with col name
    for (i in 1:5){
        RMSE = sqrt(mean(df[,i] ^ 2))
        row_num = i
    
        row = data.frame(RMSE = RMSE, `Row Number` = as.character(row_num) )
        #colnames(row) = c("RMSE", "Row Number")
        bestcombo = rbind(bestcombo, row)
    }
    

    Alternatively, if you really want to add the column names in a second line, you do this:

    for (i in 1:5){
        RMSE = sqrt(mean(df[,i] ^ 2))
        row_num = i
    
        row = data.frame(RMSE,as.character(row_num) )
        colnames(row) = c("RMSE", "Row Number")
        bestcombo = rbind(bestcombo, row)
    }
    

    Just for the sake of completeness, I'll add that while it wasn't the focus of your question, growing data frames by rbindind rows one at a time like this will begin to incur a significant speed penalty once the data frame gets to be reasonably large.