Search code examples
rindexingescapingpastedouble-quotes

R: using escape in paste to build vector of character strings, that call data from a matrix through indexing


I have some code that takes information on error rate from a random forest model (WSAA_model1), and makes a dataframe. I then plot the values to see if the model is stable at a given number of trees. The random forest model is categorical, and those categories are factors that are characters, which happen to be numbers - so "12" is a category.

      oob.error.data <- data.frame(
      Trees = rep(1:nrow(WSAA_model1$err.rate), times = 3),
      Type = rep(c("OOB", "1", "3", "4", "5", "6", "7", "10", "11", "12", "13", "14",
                   "20", "21", "22", "23", "24", "25", "26", "27", "28"), 
                   each = nrow(WSAA_model1$err.rate)),
      Error = c(WSAA_model1$err.rate[,"OOB"], 
        WSAA_model1$err.rate[,"1"], 
        WSAA_model1$err.rate[,"3"],
        WSAA_model1$err.rate[,"4"],
        WSAA_model1$err.rate[,"5"],
        WSAA_model1$err.rate[,"6"],
        WSAA_model1$err.rate[,"7"],
        WSAA_model1$err.rate[,"10"],
        WSAA_model1$err.rate[,"11"],
        WSAA_model1$err.rate[,"12"],
        WSAA_model1$err.rate[,"13"],
        WSAA_model1$err.rate[,"14"],
        WSAA_model1$err.rate[,"20"],
        WSAA_model1$err.rate[,"21"],
        WSAA_model1$err.rate[,"22"],
        WSAA_model1$err.rate[,"23"],
        WSAA_model1$err.rate[,"24"],
        WSAA_model1$err.rate[,"25"],
        WSAA_model1$err.rate[,"26"],
        WSAA_model1$err.rate[,"27"],
        WSAA_model1$err.rate[,"28"]))
    
    ggplot(data = oob.error.data, aes(x = Trees, y = Error)) +
      geom_line(aes(color = Type))

This code works as I expect, and I can use it to and builds a nice graph using ggplot.

I want to be able to apply this code to other random forest models. These other models may not have the same number of factors as the predicted outcome (given by the numbers as characters in the above code.) So I wanted to build my code so that it accessed the necessary values from the model to do the above. WSM1_model1 is the next in the series of models. I have been trying variations on the code below.

biolev <- c("OOB", levels(WSM1_model1$y))
errlev <- c()
for (i in 1:length(biolev)) {
  errlev <- c(errlev, paste0("WSM1_model1$err.rate[,", '"', biolev[i], '"', "]"))
}
oob.error.data <- data.frame(
  Trees = rep(1:nrow(WSM1_model1$err.rate), times = 3),
  Type = rep(biolev, each = nrow(WSM1_model1$err.rate)),
  Error = c(errlev))

ggplot(data = oob.error.data, aes(x = Trees, y = Error)) +
  geom_line(aes(color = Type))

biolev is a vector of characters as I expected

 [1] "OOB" "1"   "3"   "4"   "5"   "6"   "7"   "10"  "11"  "12"  "13"  "14"  "20" "21"  "23"  "27"

I have tried various versions of the for loop to get the quotes around the numbers in biolev

errlev

 [1] "WSM1_model1$err.rate[,\"27\"]"  "WSM1_model1$err.rate[,\"OOB\"]"
 [3] "WSM1_model1$err.rate[,\"1\"]"   "WSM1_model1$err.rate[,\"3\"]"  
 [5] "WSM1_model1$err.rate[,\"4\"]"   "WSM1_model1$err.rate[,\"5\"]"  
 [7] "WSM1_model1$err.rate[,\"6\"]"   "WSM1_model1$err.rate[,\"7\"]"  
 [9] "WSM1_model1$err.rate[,\"10\"]"  "WSM1_model1$err.rate[,\"11\"]" 
[11] "WSM1_model1$err.rate[,\"12\"]"  "WSM1_model1$err.rate[,\"13\"]" 
[13] "WSM1_model1$err.rate[,\"14\"]"  "WSM1_model1$err.rate[,\"20\"]" 
[15] "WSM1_model1$err.rate[,\"21\"]"  "WSM1_model1$err.rate[,\"23\"]" 
[17] "WSM1_model1$err.rate[,\"27\"]" 

If I then run the code to generate the dataframe I receive the error

Error in data.frame(Trees = rep(1:nrow(WSM1_model1$err.rate), times = 3),  : 
  arguments imply differing number of rows: 1500, 8000, 16

Although I suspect I also have an issue with 'Type' not being a multiple of 'Trees', it is the 'Error =' I am asking about here.

I have tried different methods for building the character strings including from this question


Solution

  • As I finished writing my question I was able to get some local help. I thought I might as well share our resolution here given I had already typed up my question.

    Instead of trying to build the text to make column Error as I had in my first example, I was able to more directly extract the data. Here my for-loop gets the data directly, rather than getting it below. Obviously this could be further tidied up, but I think as it is shows the change I made more clearly.

    biolev <- c("OOB", levels(WSM1_model1$y))
    errlev <- c()
    for (i in 1:length(biolev)) {
      errlev <- c(errlev, WSM1_model1$err.rate[,biolev[i]])
    }
    
    oob.error.data <- data.frame(
      Trees = rep(1:nrow(WSM1_model1$err.rate), times = 1 + (length(levels(WSM1_model1$y)))),
      Type = rep(biolev, each = nrow(WSM1_model1$err.rate)),
      Error = errlev)
    
    ggplot(data = oob.error.data, aes(x = Trees, y = Error)) +
      geom_line(aes(color = Type))
    

    This doesn't solve the original question as asked though. I am still curious to know if I could have correctly made a vector of character objects that would have called to the index of the data as I intended, or whether it just needed to be approached from a different angle.