This post is to better understand how "levels" work in R. Indeed, other answers were not fully explanatory (see for example this).
Consider the following short script, where I calculate the RMSE of each column of a random dataframe df
and store the value as a row of a new dataframe bestcombo
df = as.data.frame(matrix(rbinom(10*1000, 1, .5), nrow = 10, ncol=5))
#generate empty dataframe and assign col names
bestcombo = data.frame(matrix(ncol = 2, nrow = 0))
colnames(bestcombo) = c("RMSE", "Row Number")
#for each col of df calculate RMSE and store together with col name
for (i in 1:5){
RMSE = sqrt(mean(df[,i] ^ 2))
row_num = i
row = as.data.frame(cbind( RMSE, toString(row_num) ))
colnames(row) = c("RMSE", "Row Number")
bestcombo = rbind(bestcombo, row)
}
The problem is that "Levels" are generated. Why?
bestcombo$RMSE
RMSE RMSE RMSE RMSE RMSE
0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076 0.707106781186548
Levels: 0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076
bestcombo$RMSE[1]
RMSE
0.547722557505166
Levels: 0.547722557505166 0.774596669241483 0.707106781186548 0.836660026534076
Why is this happening and how to avoid it? Is this due to a wrong use of rbind()?
This also produces other problems. For example, the order function does not work.
bestcombo[order(bestcombo$RMSE),]
RMSE Random Vector
1 0.547722557505166 1
2 0.774596669241483 2
3 0.707106781186548 3
5 0.707106781186548 5
4 0.836660026534076 4
You want something more like this:
#for each col of df calculate RMSE and store together with col name
for (i in 1:5){
RMSE = sqrt(mean(df[,i] ^ 2))
row_num = i
row = data.frame(RMSE = RMSE, `Row Number` = as.character(row_num) )
#colnames(row) = c("RMSE", "Row Number")
bestcombo = rbind(bestcombo, row)
}
Alternatively, if you really want to add the column names in a second line, you do this:
for (i in 1:5){
RMSE = sqrt(mean(df[,i] ^ 2))
row_num = i
row = data.frame(RMSE,as.character(row_num) )
colnames(row) = c("RMSE", "Row Number")
bestcombo = rbind(bestcombo, row)
}
Just for the sake of completeness, I'll add that while it wasn't the focus of your question, growing data frames by rbind
ind rows one at a time like this will begin to incur a significant speed penalty once the data frame gets to be reasonably large.