I notice a strange behavior of the lmtree function from the partykit package when I use it with factors. If some levels are not included in the dataset (here "c" and "e"), the predictions change randomly ...
I guess this means that lmtree builds the model only with factors existing in the dataset ("a" and "b" in this example) while the predict function takes into account all factors ("a","b","c","e").
So how can I use safely factors with lmtree models ?
library(partykit)
df<-data.frame(x=runif(100),y=runif(100),v=sample(c("a","b"),100,replace=T))
df$z<-with(df,ifelse(v=="a",2*y+x,3*x-y))
df$v<-factor(df$v,levels=c("c","e","a","b"))
lmt<-lmtree(z~x+y|v,df)
for (i in 1:10) print(predict(lmt,df,type="node")[1])
A similar problem occurs if the order of factor is reversed between the lmtree function and the predict function (changing from levels=c("a","b") to levels=c("b","a") )
Thanks for raising this issue, it's a bug in partykit
(up to the current version 1.2-2).
The source of the problem is the following: The lmtree()
takes the formula
and data
and builds a model.frame
from these, setting drop.unused.levels = TRUE
. Thus, for v
only levels "a"
and "b"
are retained and "c"
and "e"
dropped. However, the same is not done in predict.party
where model.frame
is called without specifying drop.unused.levels
and thus using the default FALSE
. And then there is a mismatch between the factor levels, resulting in random assignments.
I will coordinate a fixed version with Torsten! The same problem seems to lurk in other places so we need to do some more checking first.
In the meantime the best way to avoid this is to drop the unused levels before calling lmtree
(or other functions in partykit
which seem to have the same problem).