I'm creating confusion matrices for a large dataset of sample points and need to loop through them with the confusionMatrix
function via the R package caret
(I'm using the accuracy metrics from the output i.e. I can't just use table
). There should be three classes/factors for each set of sample points, i.e. I should have 3x3 tables, however some of the reference and predicted data contain less than two classes, or non-overlapping classes ex:
Class A B Class C
A 8 2 A 3
B 1 0
C 1 7
* columns = reference data, rows = predicted data
I need the same number of classes/factors to run confusionMatrix
, so what I want to do is conditionally replace the missing factor(s) with zeros, like so:
Class A B C Class A B C
A 8 2 0 A 0 0 3
B 1 0 0 B 0 0 0
C 1 7 0 C 0 0 0
The predicted/reference data I'm using are numeric lists of values, so I won't reproduce them here; for the example I've provided above you can think of it as just a vector like:
predicted.data[1] = A A A A A A A A A A B C C C C C C C C
reference.data[1] = A A A A A A A A A A B B B B B B B B B
predicted.data[2] = A A A
reference.data[2] = C C C
I tried to create some sort of conditional if
statement along the lines of:
tab <- table(predicted.data, reference.data)
if(nrow(tab) != ncol(tab){
classes <- c("A","B","C")
missing <- setdiff(classes,names(tab))
...
...
}
# would put in a loop/index actual data obviously
But I can't seem to get it to work the way I want. Any thoughts?
Edit: example of actual data I'm using (via rasters/shapefiles) and the error message; data have same length but no reference data was classified as a '2':
> mask.vals[[4]]
[1] 0 4 0 0 0 2 4 0 4 0 4 0 0 0 0 0 4 0 4 2 0 0 0 0 0 0 0 4 0 0 0 0 0 0 4 0
0 0 0 0 0 0 0 2 2 0 2 0 4 0 0 4 2 0 0 4 0 0 0 0 0 0 0 2 0 2 0 2 4 0 4
[72] 4 0 0 0 0 4 4 0 0 0 0 0 0 0 4 0 0 0 0 4 4 4 4 0 4 4 4 4 4 0 4 4 4 0 4 0
0 4 4 4 4 4 4
> ref.data[[4]]@data$CLASS_ID
[1] 0 4 4 4 4 4 4 4 4 4 4 4 4 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[72] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4
> confusionMatrix(data = mask.vals[[4]], reference = ref.data[[4]]@data$CLASS_ID)
Error in confusionMatrix.default(data = mask.vals[[4]], reference = ref.data[[4]]@data$CLASS_ID) :
the data cannot have more levels than the reference
i.e. need to go from this:
> table(mask.vals[[4]], ref.data[[4]]@data$CLASS_ID)
0 4
0 2 67
2 0 9
4 0 36
to this:
0 2 4
0 2 0 67
2 0 0 9
4 0 0 36
This error persists even when I define the three levels for the data (e.g. levels(ref.data[[4]]@data$CLASS_ID) <- c("0","2","4")
or factor(ref.data[[4]]@data$CLASS_ID, levels = c("0","2","4")
) ...
The caret
confusionMatrix
function returns an n x n table regardless of whether some levels are absent from the reference and/or prediction vectors. I'm wondering how you managed to get a confusion matrix with some of the reference data columns missing. For example, using the built-in iris
data frame:
library(caret)
set.seed(2)
dat = data.frame(ref=iris$Species, pred=sample(iris$Species))
# Remove two levels from the reference data
dat1 = dat[dat$ref=="setosa", ]
# Get the confusion matrix
cm1 = confusionMatrix(dat1$pred, dat1$ref)
cm1$table
Reference Prediction setosa versicolor virginica setosa 15 0 0 versicolor 15 0 0 virginica 20 0 0
# No overlap between reference and prediction
dat2 = dat[dat$ref=="setosa" & dat$pred=="versicolor", ]
# Get the confusion matrix
cm2 = confusionMatrix(dat2$pred, dat2$ref)
cm2$table
Reference Prediction setosa versicolor virginica setosa 0 0 0 versicolor 15 0 0 virginica 0 0 0
In the above examples, the ref
and pred
columns are both coded as factors with the original three levels of Species
. We could recode them to drop the empty levels:
dat2$ref = droplevels(dat2$ref)
dat2$pred = droplevels(dat2$pred)
And you can see the only one factor level is present in each column:
lapply(dat2, levels)
$ref [1] "setosa" $pred [1] "versicolor"
But if you run confusionMatrix
it now throws an error because there's no overlap between the levels of the two vectors:
cm3 = confusionMatrix(dat2$pred, dat2$ref)
Error in confusionMatrix.default(dat2$pred, dat2$ref) : The data must contain some levels that overlap the reference.
UPDATE: If you set the same factor levels in the reference vector and prediction vector, confusionMatrix
will work. You've updated the question, but it's still not reproducible, so it's difficult to determine where things are going wrong in your workflow. For now, here's an example that's similar to what you've shown in your question and that works as expected after setting common factor levels.
library(caret)
set.seed(2)
mask.vals = sample(c(0,2,4), 100, replace=TRUE)
ref.data = rep(4,100)
cm = confusionMatrix(mask.vals, ref.data)
Error in confusionMatrix.default(mask.vals, ref.data) : the data cannot have more levels than the reference
mask.vals = factor(mask.vals, levels=c(0,2,4))
ref.data = factor(ref.data, levels=c(0,2,4))
cm = confusionMatrix(mask.vals, ref.data)
cm$table
Reference Prediction 0 2 4 0 0 0 35 2 0 0 31 4 0 0 34