I'm using the mlr package's framework to build a svm model to predict landcover classes in an image. I used the raster package's predict function and also converted the raster to a dataframe and then predicted on that dataframe using the "learner.model" as input. These methods gave me realistic results.
Work well:
> predict(raster, mod$learner.model)
or
> xy <- as.data.frame(raster, xy = T)
> C <- predict(mod$learner.model, xy)
However, if I predict on the dataframe derived from the raster without specifying the learner.model, the results are not the same.
> C2 <- predict(mod, newdata=xy)
C2$data$response is not identical to C. Why?
Here is a reproducible example that demonstrates the problem:
> library(mlr)
> library(kernlab)
> x1 <- rnorm(50)
> x2 <- rnorm(50, 3)
> x3 <- rnorm(50, -20, 3)
> C <- sample(c("a","b","c"), 50, T)
> d <- data.frame(x1, x2, x3, C)
> classif <- makeClassifTask(id = "example", data = d, target = "C")
> lrn <- makeLearner("classif.ksvm", predict.type = "prob", fix.factors.prediction = T)
> t <- train(lrn, classif)
Using automatic sigma estimation (sigest) for RBF or laplace kernel
> res1 <- predict(t, newdata = data.frame(x2,x1,x3))
> res1
Prediction: 50 observations
predict.type: prob
threshold: a=0.33,b=0.33,c=0.33
time: 0.01
prob.a prob.b prob.c response
1 0.2110131 0.3817773 0.4072095 c
2 0.1551583 0.4066868 0.4381549 c
3 0.4305353 0.3092737 0.2601910 a
4 0.2160050 0.4142465 0.3697485 b
5 0.1852491 0.3789849 0.4357659 c
6 0.5879579 0.2269832 0.1850589 a
> res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
> res2
[1] c c a b c a b a c c b c b a c b c a a b c b c c a b b b a a b a c b a c c c
[39] c a a b c b b b b a b b
Levels: a b c
!> res1$data$response == res2
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE FALSE
The predictions are not identical. Following mlr's tutorial page on prediction, I don't see why the results would differ. Thanks for your help.
Update: When I do the same with a random forest model, the two vectors are equal. Is this because SVM is scale dependent and random forest is not?
> library(randomForest)
> classif <- makeClassifTask(id = "example", data = d, target = "C")
> lrn <- makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = T)
> t <- train(lrn, classif)
>
> res1 <- predict(t, newdata = data.frame(x2,x1,x3))
> res1
Prediction: 50 observations
predict.type: prob
threshold: a=0.33,b=0.33,c=0.33
time: 0.00
prob.a prob.b prob.c response
1 0.654 0.228 0.118 a
2 0.742 0.090 0.168 a
3 0.152 0.094 0.754 c
4 0.092 0.832 0.076 b
5 0.748 0.100 0.152 a
6 0.680 0.098 0.222 a
>
> res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
> res2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
a a c b a a a c a b b b b c c a b b a c b a c c b c
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
a a b a c c c b c b c a b c c b c b c a c c b b
Levels: a b c
>
> res1$data$response == res2
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE
Another Update: If I change predict.type to "response" from "prob", the two svm prediction vectors agree with each other. I'm going to look into the differences of these types, I had thought that "prob" gave the same results but also gave probabilities. Maybe this isn't the case?
The answer lies here:
Why are probabilities and response in ksvm in R not consistent?
In short, ksvm type = "probabilities" gives different results than type = "response".
If I run
> res2 <- predict(t$learner.model, data.frame(x2,x1,x3), type = "probabilities")
> res2
then I get the same result as res1 above (type = "response" was default).
Unfortunately it seems that classifying an image based on the probabilities doesn't do as well as using the "response". Maybe that is still the best way to estimate the certainty of a classification?