Search code examples
rmatrixmachine-learningclassificationconfusion-matrix

R - conditional replacement of missing factors in confusion matrix


I'm creating confusion matrices for a large dataset of sample points and need to loop through them with the confusionMatrix function via the R package caret (I'm using the accuracy metrics from the output i.e. I can't just use table). There should be three classes/factors for each set of sample points, i.e. I should have 3x3 tables, however some of the reference and predicted data contain less than two classes, or non-overlapping classes ex:

 Class  A  B             Class  C
   A    8  2               A    3
   B    1  0
   C    1  7
* columns = reference data, rows = predicted data

I need the same number of classes/factors to run confusionMatrix, so what I want to do is conditionally replace the missing factor(s) with zeros, like so:

 Class  A  B  C          Class  A  B  C
   A    8  2  0            A    0  0  3
   B    1  0  0            B    0  0  0
   C    1  7  0            C    0  0  0

The predicted/reference data I'm using are numeric lists of values, so I won't reproduce them here; for the example I've provided above you can think of it as just a vector like:

predicted.data[1] = A A A A A A A A A A B C C C C C C C C
reference.data[1] = A A A A A A A A A A B B B B B B B B B
predicted.data[2] = A A A
reference.data[2] = C C C 

I tried to create some sort of conditional if statement along the lines of:

   tab <- table(predicted.data, reference.data)
   if(nrow(tab) != ncol(tab){
   classes <- c("A","B","C")
   missing <- setdiff(classes,names(tab))
   ...
   ...
   }

# would put in a loop/index actual data obviously 

But I can't seem to get it to work the way I want. Any thoughts?

Edit: example of actual data I'm using (via rasters/shapefiles) and the error message; data have same length but no reference data was classified as a '2':

> mask.vals[[4]]
  [1] 0 4 0 0 0 2 4 0 4 0 4 0 0 0 0 0 4 0 4 2 0 0 0 0 0 0 0 4 0 0 0 0 0 0 4 0 
0 0 0 0 0 0 0 2 2 0 2 0 4 0 0 4 2 0 0 4 0 0 0 0 0 0 0 2 0 2 0 2 4 0 4
 [72] 4 0 0 0 0 4 4 0 0 0 0 0 0 0 4 0 0 0 0 4 4 4 4 0 4 4 4 4 4 0 4 4 4 0 4 0 
0 4 4 4 4 4 4

> ref.data[[4]]@data$CLASS_ID
  [1] 0 4 4 4 4 4 4 4 4 4 4 4 4 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [72] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
4 4 4 4 4 4 4

> confusionMatrix(data = mask.vals[[4]], reference = ref.data[[4]]@data$CLASS_ID)

Error in confusionMatrix.default(data = mask.vals[[4]], reference = ref.data[[4]]@data$CLASS_ID) : 
  the data cannot have more levels than the reference

i.e. need to go from this:

> table(mask.vals[[4]], ref.data[[4]]@data$CLASS_ID)
        0  4          
   0    2  67              
   2    0  9
   4    0  36

to this:

        0  2  4          
   0    2  0  67              
   2    0  0  9
   4    0  0  36

This error persists even when I define the three levels for the data (e.g. levels(ref.data[[4]]@data$CLASS_ID) <- c("0","2","4") or factor(ref.data[[4]]@data$CLASS_ID, levels = c("0","2","4")) ...


Solution

  • The caret confusionMatrix function returns an n x n table regardless of whether some levels are absent from the reference and/or prediction vectors. I'm wondering how you managed to get a confusion matrix with some of the reference data columns missing. For example, using the built-in iris data frame:

    library(caret)
    
    set.seed(2)
    dat = data.frame(ref=iris$Species, pred=sample(iris$Species))
    
    # Remove two levels from the reference data
    dat1 = dat[dat$ref=="setosa", ]
    
    # Get the confusion matrix
    cm1 = confusionMatrix(dat1$pred, dat1$ref)
    
    cm1$table
    
                Reference
    Prediction   setosa versicolor virginica
      setosa         15          0         0
      versicolor     15          0         0
      virginica      20          0         0
    
    # No overlap between reference and prediction
    dat2 = dat[dat$ref=="setosa" & dat$pred=="versicolor", ]
    
    # Get the confusion matrix
    cm2 = confusionMatrix(dat2$pred, dat2$ref)
    
    cm2$table
    
                Reference
    Prediction   setosa versicolor virginica
      setosa          0          0         0
      versicolor     15          0         0
      virginica       0          0         0
    

    In the above examples, the ref and pred columns are both coded as factors with the original three levels of Species. We could recode them to drop the empty levels:

    dat2$ref = droplevels(dat2$ref)
    dat2$pred = droplevels(dat2$pred)
    

    And you can see the only one factor level is present in each column:

    lapply(dat2, levels)    
    
    $ref [1] "setosa"
    
    $pred [1] "versicolor"
    

    But if you run confusionMatrix it now throws an error because there's no overlap between the levels of the two vectors:

    cm3 = confusionMatrix(dat2$pred, dat2$ref)
    

    Error in confusionMatrix.default(dat2$pred, dat2$ref) : The data must contain some levels that overlap the reference.

    UPDATE: If you set the same factor levels in the reference vector and prediction vector, confusionMatrix will work. You've updated the question, but it's still not reproducible, so it's difficult to determine where things are going wrong in your workflow. For now, here's an example that's similar to what you've shown in your question and that works as expected after setting common factor levels.

    library(caret)
    
    set.seed(2)
    mask.vals = sample(c(0,2,4), 100, replace=TRUE)
    ref.data = rep(4,100)
    
    cm = confusionMatrix(mask.vals, ref.data)
    
    Error in confusionMatrix.default(mask.vals, ref.data) : 
      the data cannot have more levels than the reference
    
    mask.vals = factor(mask.vals, levels=c(0,2,4))
    ref.data = factor(ref.data, levels=c(0,2,4))
    
    cm = confusionMatrix(mask.vals, ref.data) 
    
    cm$table
    
              Reference
    Prediction  0  2  4
             0  0  0 35
             2  0  0 31
             4  0  0 34