I'm trying to replicate the procedure proposed here on my data but I get the following error:
Error in interval.numeric(x, breaks = c(xmin - tol, ux, xmax)) :
invalid number of intervals
target
is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable
(categorical too). Due to the object characteristics, indeed, if split.variable
is 1 target can be only 1, while if it is 0, target
can be or 0 or 1.
Initially I treated them as factors but I changed them to numeric and then rounded (as suggested in other posts in SO). Unfortunately, none of these solutions were helpful.
I played a bit with the data, subsampling cols and rows but still it doesn't work.
What am I missing?
Here is an MRE to replicate the error:
library(partykit)
tdf = structure(list(target = c(0, 0, 0, 1, 0, 0, 1, 1, 1, 1), split.variable = c(0,
0, 0, 0, 1, 0, 0, 0, 0, 0), var1 = c(2.021, 1.882, 1.633, 3.917,
2.134, 1.496, 1.048, 1.552, 1.65, 3.112), var2 = c(97.979, 98.118,
98.367, 96.083, 97.866, 98.504, 98.952, 98.448, 98.35, 96.888
), var3 = c(1, 1, 1, 0.98, 1, 1, 1, 1, 1, 1), var4 = c(1, 1,
1, 0.98, 1, 1, 1, 1, 1, 1), var5 = c(18.028, 25.207, 20.788,
28.548, 18.854, 19.984, 27.352, 24.622, 25.037, 24.067), var6 = c(0.213,
0.244, 0.289, 0.26, 0.887, 0.575, 0.097, 0.054, 0.104, 0.096),
var7 = c(63.22, 59.845, 62.45, 63.48, 52.143, 51.256, 56.296,
57.494, 59.543, 68.434), var8 = c(0.748, 0.795, 0.807, 0.793,
0.901, 0.909, 0.611, 0.61, 0.618, 0.589)), row.names = c(6L,
7L, 8L, 9L, 11L, 12L, 15L, 16L, 17L, 18L), class = "data.frame")
tr1 <- ctree(target ~ split.variable, data = tdf, maxdepth = 1)
tr2 <- ctree(target ~ split.variable + ., data = tdf, subset = predict(tr1, type = "node") == 2)
Your data set is too small to do what you want:
tr1
does not lead to any splits but produces a tree with a single root node.predict(tr1, type = "node")
produces a vector of 10 times 1
.subset
with predict(tr1, type = "node") == 2
is empty (all FALSE
).Additionally: I'm not sure where you found the recommendation to use numeric codings of categorical variables. But for partykit
you are almost always better off coding categorical variables appropriately as factor
variables.