I train support vector machines using the ksvm
function from the kernlab package in R, on large numbers of observations (300k) with not very many features (1-8). I want to use the resulting probability model, but for large data sets, the resulting probability model has an unexpected format.
This is what should happen:
n <- 1000
df <- data.frame(label=c(rep("x",n),rep("y",n)),value=c(runif(n),runif(n)+2))
m <- ksvm(label~value,df,prob.model=TRUE)
> prob.model(m)
[[1]]
[[1]]$A
[1] -6.836228
[[1]]$B
[1] 0.003163229
However, for large values of n
(e.g. 100k; beware of high memory usage and long execution times), the value of prob.model(m)[[1]]
is a numeric vector of length 2n
, seemingly the likelihood for each observation in df
. What could cause this?
Session info:
R version 2.15.2 (2012-10-26)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] graphics grDevices datasets utils stats methods base
other attached packages:
[1] kernlab_0.9-16 e1071_1.6-1 class_7.3-5 data.table_1.8.8
loaded via a namespace (and not attached):
[1] tools_2.15.2
Edit: this is a classification task I'm talking about, df
has the following form:
label value
"x" 0.21
...
"x" -1.20
"y" 2.42
...
The origin of the problem is indicated by the following error message:
line search fails
A more specific question, including the original data frame I used, is here: Line search fails in training ksvm prob.model.