subgroup identification with regression tree

I am very interested in a recent paper on "Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials " (DOI: 10.1002/sim.7064), and I want to reproduce results in the section "Performance of the tree-based regression approach". However, the partitioning tree seems not get the result as I want.

set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- runif(n)
t <- rbinom(n,1,0.5)
x3 <- rbinom(n,1,0.3)
x4 <- rnorm(n)
z <-1+ 2*x1-1.8*x2+(x1>=-0.3)*(x2>=0.4)*t-(x1< -0.3)*(x2<0.4)*t
pr = 1/(1+exp(-z))
y = rbinom(n,1,pr)
dt <- data.frame(x1,x2,x3,x4,t,y)

library(party)
mb <- mob(y~t-1|x1+x2+x3+x4,
  data=dt,
  model = glinearModel,
  family = binomial(),
  control=mob_control(minsplit=100))
plot(mb)

The figure is shown above. It is supposed to split on x1 and x2 at cutoff values of -0.3 and 0.4 defined in the simulation. However, it doesn't appear to do so.x1 dominated the node partition and x2 seems not an important determinant of the partitioning process. What's wrong with the code?

Solution

The model-based GLM tree you have specified tries to fit the following model: logit(prob) = alpha_tree(x1-x4) * t, where alpha_tree(x1-x4) is a subgroup-specific coefficient from the tree (based on partitioning variables x1-x4) for the treatment indicator t. Thus, this model does not include the possibility for an intercept or global main effects of x1 and x2 - as simulated in your data. As these main effects are quite pronounced the model has no other possibility except to capture these by further splits. Hence, the large tree in your example.

In the classic MOB framework (Zeileis et al. 2008, Journal of Computational and Graphical Statistics, doi:10.1198/106186008X319331), the only option would be to include every relevant regressor in the model to be partitioned, i.e., logit(mu) = beta0_tree(x1-x4) + beta1_tree(x1_x4) * x1 + beta2_tree(x1-x4) * x2 + alpha_tree(x1-x4) * t. This works and will detect subgroups with respect to only the treatment effect alpha * t as well but, of course, loses some efficiency because it re-estimates the beta coefficients in every subgroup as well. A discussion of this approach specifically for subgroup analyses is available in Seibold et al. (2016a), The International Journal of Biostatistics, doi:10.1515/ijb-2015-0032.

Recently, we have suggested an adaptation of MOB that we called PALM tree for partially additive linear model trees (Seibold et al. 2016b, http://arxiv.org/abs/1612.07498). This allows to fit models of the type logit(mu) = beta0 + beta1 * x1 + beta2 * x2 + alpha_tree(x1-x4) * t as you simulated in your question.

Implementations of both the classic GLM-based MOB tree and the PALM tree are available as glmtree() and palmtree(), respectively, in partykit which is recommended over the old party implementation. Applying these to your simulated data above, yields the following. First, it is important that all categorical partitioning variables are also flagged as factor variables (in order to choose the right parameter instability tests):

dt <- transform(dt, x3 = factor(x3))

Then, we can fit the model from which you have simulated with only a subgroup-specific treatment effect, a global main effect of x1 and x2, and partitioning based on x1, x2, x3, x4.

library("partykit")
tr1 <- palmtree(y ~ t - 1 | x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr1
## Partially additive generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t - 1 | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.21797: n = 399
## |                t 
## |       -0.9245345 
## |   [3] x1 > -0.21797: n = 601
## |               t 
## |       0.6033979 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 1
## Objective function (negative log-likelihood): 432.5717
## 
## Linear fixed effects (from palm model):
## (Intercept)          x1          x2 
##   0.7140397   1.7589675  -1.1335779

Thus, this covers the most important parts of the data-generating process but fails to detect the interaction with x2.

plot(tr1)

I played around with setting other seeds or using BIC-based post-pruning (combined with a large significance level) which sometimes could also discover the interaction with x2. Presumably, a larger sample would yield the "true" tree more often, as well. Thus, the model seems to be in principle capable to fit the model you simulated, just not in this particular setting.

Personally, I would always let both the intercept and the treatment effect vary across subgroups. Because if there are any main effects that I overlooked this is likely to yield a better model. If an intercept is included in every subgroup, then it is nicer to code both y and t as factors, yielding nicer plots in the visualization:

dt <- transform(dt, y = factor(y), t = factor(t))
tr2 <- palmtree(y ~ t | x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr2
## Partially additive generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.26515: n = 382
## |       (Intercept)          t1 
## |         0.9839353  -1.1376981 
## |   [3] x1 > -0.26515: n = 618
## |       (Intercept)          t1 
## |         0.5331386   0.6566111 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (negative log-likelihood): 432.3303
## 
## Linear fixed effects (from palm model):
##        x1        x2 
##  1.964397 -1.078958

For this data, this fits almost the same model as above. But the display is much easier to read, showing the large difference in absolute treatment effect between the two groups:

plot(tr2)

Finally, if using a plain old MOB without the possibility to include additive main effects, we should include the regressors x1 and x2 in every subgroup:

tr3 <- glmtree(y ~ t + x1 + x2 | x1 + x2 + x3 + x4, data = dt,
  family = binomial, minsize = 100)
tr3
## Generalized linear model tree (family: binomial)
## 
## Model formula:
## y ~ t + x1 + x2 | x1 + x2 + x3 + x4
## 
## Fitted party:
## [1] root
## |   [2] x1 <= -0.26515: n = 382
## |       (Intercept)          t1          x1          x2 
## |         0.5570219  -1.0511317   1.2533975  -1.3899679 
## |   [3] x1 > -0.26515: n = 618
## |       (Intercept)          t1          x1          x2 
## |         0.3573041   0.6943206   2.2910053  -0.9570403 
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2
## Number of parameters per node: 4
## Objective function (negative log-likelihood): 429.2406

Again, this essentially finds the same subgroups as before. However, it loses a bit of efficiency because more regression coefficients have to be estimated in each subgroup while only the t coefficient really changes between the subgroups.

plot(tr3, tp_args = list(which = 1))