Search code examples
rlistr-caretrpartconfusion-matrix

Applying a function to generate confusion matrices from nested lists of classification tree class probabilities within a list


My apologies in advance for such a long and detailed explanation of my problem. I have produced 10 nested dataframes from classification tree class probabilities (grouping factor: G8 and V4) in a master-list using three functions Shuffle100 my_List and Final_lists (below). I am sorry that I am asking this simple question but I wasn't able to figure it out. Many thanks is advance if anyone find a solution.

Aim 1

(1) I would like to insert the functionconfusionMatrix() from the caret package into the function shuffle100 to produce 10 confusion matrices for each subset

Functions shuffle100, my_list and Final_lists

library(plyr)
library(caret)
library(e1071)
library(rpart)

set.seed(1235)

 shuffle100 <-lapply(seq(10), function(n){ #Select the production of 10 dataframes
 subset <- normalised_scores[sample(nrow(normalised_scores), 80),] #Shuffle rows
 subset_idx <- sample(1:nrow(subset), replace = FALSE)
 subset <- subset[subset_idx, ] #training subset
 subset1<-subset[-subset_idx, ] #test subset
 subset_resampled_idx <- createDataPartition(subset_idx, times = 1, p = 0.7, list = FALSE) #70 % training set    
 subset_resampled <- subset[subset_resampled_idx, ]
 ct_mod<-rpart(Matriline~., data=subset_resampled, method="class", control=rpart.control(cp=0.005)) #10 ct
 ct_pred<-predict(ct_mod, newdata=subset[, 2:13]) 
 ct_dataframe=as.data.frame(ct_pred)#create new data frame
 confusionMatrix(ct_dataframe, normalised_scores$Family)
 }

  Error in sort.list(y) : 'x' must be atomic for 'sort.list'
  Have you called 'sort' on a list?

 1: lapply(seq(10), function(n) {
subset <- normalised_scores[sample(nrow(normalised_scores
 2: FUN(X[[i]], ...)
 3: confusionMatrix(ct_dataframe, normalised_scores$Family)
 4: confusionMatrix.default(ct_dataframe, normalised_scores$Family)
 5: factor(data)
 6: sort.list(y)

 #Produce three columns: Predicted, Actual and Binary
 my_list <- lapply(shuffle100, function(df){#Create two new columns Predicted and Actual
                  if (nrow(df) > 0)
                cbind(df, Predicted = c(""), Actual = c(""), Binary = c(""))
         else
                 bind(df, Predicted = character(), Actual = c(""), Binary = c (""))
                 })

#Fill the empty columns with NA's
Final_lists <- lapply(my_list, function(x) mutate(x, Predicted = NA, Actual = NA, Binary = NA)) 

#Create a dataframe from the column normalised_scores$Family to fill the Actual column

Actual_scores<-Final_normalised3$Family
Final_scores<-as.data.frame(Actual_scores)

#Fill in the Predicted, Actual and Binary columns

 Predicted_Lists <- Final_lists %>%
 mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8 > V4 then Predicted=G8
 mutate(Actual=Final_scores) %>% # your definition of Actual is not clear
 mutate(Binary=ifelse(Predicted==Actual, 1, 0))

#Error messages

Error in ifelse(G8 > V4, G8, V4) : object 'G8' not found

Aim 2

To write a function or for loop to fill the Predicted, Actual and Binary columns for each subset based on the condition that probabilities in rows for columns V4 or G8 may be greater than or lesser than each other. However, I am confused with the correct syntax for functions and loops

A for loop which does not work

  for(i in 1:length(Final_lists)){ #i loops through each dataframe in the list 
   for(j in 2:nrow(Final_lists[[i]])){ #j loops through each row of each dataframe in the list
   if(Final_lists[[i]][j, "G8"] > Final_lists[[i]][j, "V4"]) { #if the probability of G8 > V4 in each row of each dataframe in each list
      Final_lists[[i]][j, [j["Predicted" == "NA"]] ="G8" #G8 will be filled into the same row in the `Predicted' column
      }
    else {
   Final_lists[[i]][j, [Predicted == "NA"]] ="V4" #V4 will be filled into the same row in the `Predicted' column
    }
print(i)
    }
    }

Each subset should have this format when columns are filled:

               G8        V4 Predicted Actual Binary
        0.1764706 0.8235294        V4     V4      1
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.1764706 0.8235294        V4     V4      1

Filling the Predicted column

If the probability of G8 > V4, then the empty Predicted row is assigned G8. However, if V4 > G8, then the empty `Predicted' row will be assigned V4.

Filling the Actual column

These are the actual predicted class probability predictions from the classification tree models for each subset, which are contained in the data_frame `normalised_scores

Filling in the Binary column

If the Predicted and Actual rows have the same result (e.g. G8 and G8), then the Binary row is assigned the value of 1. However, if rows of the Predicted and Actual columns are different (e.g. G8 and V4), then the Binary row is assigned the value 0.

I achieved these aims using this working code, however, I am unsure how to apply this code to the subsets in the master-list.

Working code for a single subset

      set.seed(1235)

    # Randomly permute the data before subsetting
      mydat_idx <- sample(1:nrow(Final_normalised_scores), replace = FALSE)
      mydat <- Final_normalised3[mydat_idx, ]

      mydat_resampled_idx <- createDataPartition(mydat_idx, times = 1, p = 0.7, list = FALSE)
      mydat_resampled <- mydat[mydat_resampled_idx, ] # Training portion of the data
      mydat_resampled1 <- mydat[-mydat_resampled_idx, ]

      #Classification tree

      ct_mod <- train(x = mydat_resampled[, 2:13], y = as.factor(mydat_resampled[, 1]), 
            method = "rpart", trControl = trainControl(method = "repeatedcv", number=10, repeats=100, classProbs = TRUE))

       #Model predictions
       ct_pred <- predict(ct_mod, newdata = mydat[ , 2:13], type = "prob")
       Final_Predicted<-as.data.frame(ct_pred)

       #Produce three empty columns: Predicted, Actual and Binary

       Final_Predicted$Predicted<-NA
       Final_Predicted$Actual<-NA
       Final_Predicted$Binary<-NA

       #Fill in the Predicted column

      for (i in 1:length(Final_Predicted$G8)){
        if(Final_Predicted$G8[i]>Final_Predicted$V4[i]) {
           Final_Predicted$Predicted[i]<-"G8"
           }
      else {
           Final_Predicted$Predicted[i]<-"V4"
           }
           print(i)
           }

        #Fill in the Actual column using the actual predictions from the dataframe normalised_scores

        Final_Predicted$Actual<-normalised_scores$Family

        #Fill in the Binary column

        for (i in 1:length(Final_Predicted$Binary)){
           if(Final_Predicted$Predicted[i]==Final_Predicted$Actual[i]) {
              Final_Predicted$Binary[i]<-1
              }
         else {
              Final_Predicted$Binary[i]<-0
              }
              print(i)
              }

Subset from the master-list

                  G8        V4 Predicted Actual Binary
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA

Reproducible dummy data

SummarySE (Rmisc package) to produce a barplot with error bars (ggplot2)


Solution

  • Your description of the problem is a bit long, but a possible dplyr solution would look like this:

    Final_Predicted$Actual <- ... # fill actual values
    Final_Predicted <- Final_Predicted %>%
                  mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8==V4 then Predicted=V4
                  mutate(Binary=ifelse(Predicted==Actual, 1, 0))
    

    I didn't actually run this solution, but it should be something short and simple along these lines. Hope this helps.