I am trying to use the MOB procedure from the R package partykit
to predict survival probabilities based on a set of covariates X1,...,X25 and a treatment effect W. The linear predictor in each node in MOB only uses W, X1, and X2, but each covariate is used for selection for node splitting. I would like to force the MOB to only split according to parameter instability for the treatment effect W. When doing prediction in the final line of code below, I get the following error:
Error in rval[ix[[i]], ] <- preds[[i]] :
number of items to replace is not a multiple of replacement length
In addition: Warning messages:
1: 'newdata' had 1070 rows but variables found have 1029 rows
2: 'newdata' had 1337 rows but variables found have 1291 rows
3: 'newdata' had 1690 rows but variables found have 1680 rows
4: 'newdata' had 903 rows but variables found have 1000 rows
I believe this error occurs because the number of test observations falling in each terminal node is different than that of the training observations. How can I modify the predict statement to handle this issue and obtain predictions on the test set? I would also like to know if I'm using the parm
option correctly in specifying that parameter instability should be assessed according to W.
library("survival")
library("partykit")
n=5000;n.test=5000;p=25;pi=0.5;beta=1
gamma=0.5;rho=2;cen.scale=4;n.mc=10000;
Y.max=2
generate_data <- function(n, p, pi = 0.5, beta = 1, gamma = 1, rho = 2, cen.scale = 4,
Y.max = NULL){
W <- rbinom(n, 1, pi)
X <- matrix(rnorm(n * p), n, p)
numerator <- -log(runif(n))
cox.ft <- (numerator / exp(beta * X[ ,1] + (-0.5 - gamma * X[ ,2]) * W))^2
failure.time <- pmin(cox.ft, Y.max)
numeratorC <- -log(runif(n))
censor.time <- (numeratorC / (cen.scale ^ rho)) ^ (1 / rho)
Y <- pmin(failure.time, censor.time)
D <- as.integer(failure.time <= censor.time)
list(X = X, Y = Y, W = W, D = D)
}
data <- generate_data(n, p=p, pi = pi, beta = beta, gamma = gamma, rho = rho, cen.scale = cen.scale,
Y.max = Y.max)
data.test <- generate_data(n.test, p=p, pi = pi, beta = beta, gamma = gamma, rho = rho, cen.scale = cen.scale,
Y.max = Y.max)
X=data$X
Y=data$Y
W=data$W
D=data$D
var_prog <- c("X1","X2")
colnames(X) <- paste("X", 1:25, sep="")
cov.names <- colnames(X)
wbreg <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
survreg(y ~ 0 + x, weights = weights, dist = "weibull", ...)
}
dat <- data.frame(Y=Y,D=D,W=W,X)
eqn <- paste0("Surv(Y, D) ~ W + ",paste0(var_prog, collapse = "+")," | ",
paste0(cov.names, collapse = "+"))
glmtr <- partykit::mob(as.formula(eqn), data = dat,
fit = wbreg, control = mob_control(parm=2,minsize = 0.2*nrow(dat),
alpha = 0.10, bonferroni = TRUE))
plot(glmtr)
dat.test <- data.frame(Y=data.test$Y,D=data.test$D, W=data.test$W,data.test$X)
pct <- 1:98/100
quantile_pred <- predict(glmtr, newdata = dat.test, type = "quantile",p=pct)
The problem is that dat.test
only provides the original variables that mob()
has seen (i.e., Y
, D
, W
, etc.) while survreg()
has seen the processed variables y
and x
.
The predict()
method for mob()
trees internally first predicts the node ID (which works smoothly in your example) and then passes on the correct subsets of newdata
to the predict()
method for the fitted model objects (from survreg()
in this case). As the latter does not find the variables y
and x
in newdata
it takes them from the learning data. Hence you get the warnings/errors about the mismatching dimensions.
So there are two ways to deal with this:
survreg
output believe it was fitted with the formula Surv(Y, D) ~ W + X1 + X2
ornewdata
to provide x
.Strategy 1 is what lmtree()
and glmtree()
do internally. You have to be careful, though, that everything still works correctly when changing the supposed formula and terms. Hence, it is easier to apply strategy 2 safely, which is what I would recommend here.
dat.test$x <- model.matrix(~ W + X1 + X2, data = dat.test)
predict(glmtr, newdata = head(dat.test, 4), type = "quantile", p = 1:9/10)
## [,1] [,2] [,3] [,4] [,5] [,6]
## 1 0.0044903736 0.019754829 0.049863880 0.10133392 0.18511217 0.32115772
## 2 0.0008963665 0.003943451 0.009953807 0.02022824 0.03695202 0.06410937
## 3 0.0076736262 0.034940287 0.090110107 0.18616205 0.34486226 0.60601526
## 4 0.0014907219 0.006787697 0.017505298 0.03616489 0.06699489 0.11772795
## [,7] [,8] [,9]
## 1 0.5505803 0.9765134 1.9803484
## 2 0.1099066 0.1949312 0.3953163
## 3 1.0520106 1.8908381 3.8980339
## 4 0.2043695 0.3673249 0.7572541
Caveat: The predict()
method for survreg()
objects with a multivariate p
only returns a matrix if newdata
has more than one row. If newdata
has just a single row it returns a vector. This confuses the predict()
method for mob()
if it happens in the first node where predict()
is applied because this determines the dimension of the output object. If it happens in subsequent nodes it is no problem. Also, univariate p
is never a problem.
Bonus: Yes, you are using parm
as intended. However, note that this only affects the parameter instability tests. Thus, the splitting variables in the tree are selected based on how much the W
effect changes along those variables. But for selecting the split point in the variable the full log-likelihood of the model (including all regressors) is maximized. Thus, the split point may be sensitive to changes in all coefficients, not just the one of W
.