Search code examples
rpartitioningparty

lmtree suspect behavior with factors


I notice a strange behavior of the lmtree function from the partykit package when I use it with factors. If some levels are not included in the dataset (here "c" and "e"), the predictions change randomly ...

I guess this means that lmtree builds the model only with factors existing in the dataset ("a" and "b" in this example) while the predict function takes into account all factors ("a","b","c","e").

So how can I use safely factors with lmtree models ?

library(partykit)

df<-data.frame(x=runif(100),y=runif(100),v=sample(c("a","b"),100,replace=T))
df$z<-with(df,ifelse(v=="a",2*y+x,3*x-y))
df$v<-factor(df$v,levels=c("c","e","a","b"))

lmt<-lmtree(z~x+y|v,df)

for (i in 1:10) print(predict(lmt,df,type="node")[1])

A similar problem occurs if the order of factor is reversed between the lmtree function and the predict function (changing from levels=c("a","b") to levels=c("b","a") )


Solution

  • Thanks for raising this issue, it's a bug in partykit (up to the current version 1.2-2).

    The source of the problem is the following: The lmtree() takes the formula and data and builds a model.frame from these, setting drop.unused.levels = TRUE. Thus, for v only levels "a" and "b" are retained and "c" and "e" dropped. However, the same is not done in predict.party where model.frame is called without specifying drop.unused.levels and thus using the default FALSE. And then there is a mismatch between the factor levels, resulting in random assignments.

    I will coordinate a fixed version with Torsten! The same problem seems to lurk in other places so we need to do some more checking first.

    In the meantime the best way to avoid this is to drop the unused levels before calling lmtree (or other functions in partykit which seem to have the same problem).