Search code examples
rparty

partykit minsize option drops branches that exceed minsize


I'm using the lmtree() function from partykit to partition data using linear regressions. The regressions use a weight, and I want to ensure that each branch has a minimum total weight, which I specify with the minsize option. For instance, in the following example the tree only has two branches instead of three because x1=="C" has too small a weight to be in its own branch.

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1="A", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="B", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1="C", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
X$x1 <- factor(X$x1)
tr <- lmtree(y ~ TT | x1, data=X, weight=weight, minsize=150)

Fitted party:
[1] root
|   [2] x1 in A: n = 200
|       (Intercept)          TT 
|         0.7724903   0.2002023 
|   [3] x1 in B, C: n = 300
|       (Intercept)          TT 
|         0.5759213   0.4659592 

I also have some real-world data that unfortunately is confidential but is leading to some behavior that I do not understand. When I do not specify minsize it builds a tree with 30 branches, where in every branch the total weight n is a large number. However, when I specify a minsize that is well below the total weight of every branch from this first tree the result is a new tree with many fewer branches. I would not have expected the tree to change at all because it seems that minsize is not binding. Is there any explanation for this result?

UPDATE

Providing an example

n <- 100
X <- rbind(
  data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
  data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
tr <- lmtree(y ~ TT | x1, data=X, weights = weight)

Fitted party:
[1] root
|   [2] x1 <= 0.29787: n = 200
|       (Intercept)          TT 
|         0.8431985   0.1994021 
|   [3] x1 > 0.29787
|   |   [4] x1 <= 0.69515: n = 200
|   |       (Intercept)          TT 
|   |         0.6346980   0.3995678 
|   |   [5] x1 > 0.69515: n = 100
|   |       (Intercept)          TT 
|   |         0.4792462   0.5987472 

Now let's set minsize=150. The tree no longer has any splits even though x1 <= 0.3 and x1 > 0.3 would work.

tr <- lmtree(y ~ TT | x1, data=X, weights = weight, minsize=150)

Fitted party:
[1] root: n = 500
    (Intercept)          TT 
      0.6870078   0.3593374

Solution

  • Two rules applied in mob() (the infrastructure underlying lmtree()) are important in this context which may benefit from more explicit discussion:

    • If mob() selects a splitting variable at any stage that then does not lead to a single admissible split (in terms of minimal node size), then splitting stops at that point. This is in contrast to ctree() which always performs a split if a significant test was detected - even if the second-best variable was non-significant. It would probably be good to offer more granular control over this - and we have it on our wishlist for the upcoming revision of the package.

    • By default the weights are interpreted as case weights, i.e., mob() thinks that there were w independent observations identical to the given one. Thus, the number of observations is the sum of weights. But note that this also affects the significance tests for which the sample size increases!

    As for your main question: It's hard to come up with an explanation without any reproducible example. I agree that partykit should behave in the way you describe it - but maybe there is one important but not so obvious detail that you haven't noticed yet... It would be good if you could come up with a small/simple artificial data set that replicates the problem.

    Update

    As already pointed out in the comments: Thanks for the reproducible example in your updated question. This helped me track down a bug in mob() in handling case weights. There was an error in the computation of the test statistic in the presence of case weights, thus leading to incorrect split variable selection and stopping criterion. I have just fixed this bug and the new partykit development version is available from R-Forge at https://r-forge.r-project.org/R/?group_id=261. (Note, however, that R-Forge at the moment only builds Windows binaries for R 3.3.x. If a more recent Windows version is used, please use type = "source" to install the source package - and make sure you have the necessary Rtools installed.)

    In your example I just set a random seed for exact reproducibility. The weighted data is set up as:

    set.seed(1)
    n <- 100
    X <- rbind(
      data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
      data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
      data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
    )
    

    Then the weighted tree can be fitted as before. In this particular example the tree structure remains unaffected but the test statistics and p-values of the parameter instability test in each node changes somewaht:

    library("partykit")
    tr1 <- lmtree(y ~ TT | x1, data = X, weights = weight)
    plot(tr1)
    

    tree1

    Adding the minsize = 150 argument now has the expected effect of just avoiding the split in node 3.

    tr2 <- lmtree(y ~ TT | x1, data = X, weights = weight, minsize = 150)
    plot(tr2)
    

    tree2

    To check that the latter actually does the right thing we compare it with the tree for the explicitly expanded data. Thus, as the data are regarded as case weights here, we can inflate the data set by repeating thos observations with weights greater than 1.

    Xw <- X[rep(1:nrow(X), X$weight), ]
    tr3 <- lmtree(y ~ TT | x1, data = Xw, minsize = 150)
    

    The resulting coefficients are the same (up to very small numerical differences):

    all.equal(coef(tr2), coef(tr3))
    ## [1] TRUE
    

    And, more importantly, all test statistics and p-values in the nodes are also the same:

    library("strucchange")
    all.equal(sctest(tr2), sctest(tr3))
    ## [1] TRUE