I want to bin a score into bins. The score indicates the likelihood of default = 1. The bins should be found automatically using glmtree
from the partykit library in R. A bin should contain scores with a similar default rate. This worked before, but now the score has gotten much better at predicting defaults and it seems partykit does not find a solution.
Here is an example using synthetic data which is close to mine.
library(tidyverse)
sigmoid <- function(x) {
1 / (1 + exp(-x))
}
n_sample <- 10^5
score <- runif(n_sample, min = -5, max = 5)
defaults <- rbinom(length(score), size = 1, prob = sigmoid(score))
df <- tibble(score = score, default_flag = defaults)
partykit::glmtree(formula = default_flag ~ score,
data = as.data.frame(df),
family = binomial)
The average default rate looks as follows when binning the score into 100 equally sized bins
df %>%
mutate(score_bin = cut(score, breaks = 100)) %>%
group_by(score_bin) %>%
summarise(default_rate = sum(default_flag)/ n()) %>%
plot()
My intuition is, that partykit does not find a solution, because many cuts would work very well, i.e. would split into groups with more and with less defaults. Does this make sense?
How can I make partykit::glmtree find a binning for this example?
I have tried
TL;DR
In your example, glmtree()
does find a tree that corresponds to 23 bins which are very similar to the 24 bins that ctree()
finds.
Background: The main difference between the two algorithms is that glmtree()
is a lot (!) slower because it does not exploit that you are only estimating a mean/intercept within each bin. Thus, it always sets up a GLM, runs the IWLS algorithm (iteratively weighted least squares) etc. In contrast, ctree()
is designed for this situation of constant fits and also codes many aspects of the algorithm in C which pays off for a larger sample size as in your example.
Practical considerations:
ctree()
for sample sizes like this. The results will typically be very similar.glmtree()
with more than just an intercept. If you a logistic regression of default on score then no splits at all are needed because you simulated the data from a logistic model.glmtree()
, you might consider applying some mild rounding which alleviates the computational burden of finding the best split point within the tree.Illustration: Let's compare these three trees:
## data as data.frame with factor response
df <- as.data.frame(df)
df$default_flag <- factor(df$default_flag)
## GLM tree with intercept only (many splits, slow)
tr <- glmtree(default_flag ~ score, data = df, family = binomial)
## GLM tree with score as linear logistic regressor (no splits, fast)
tr2 <- glmtree(default_flag ~ score | score, data = df, family = binomial)
## CTree with intercept only (many splits, fast)
tr3 <- ctree(default_flag ~ score, data = df)
The resulting fitted curves look like this:
Note that the true sigmoid curve (yellow line) is recovered almost exactly by the GLM tree with intercept and slope (red line). The latter tree has no splits at all and just estimates one intercept (close to 0) and one slope (close to 1).
The two constant-fit trees yield much cruder step approximations of the true smooth function. GLM tree (blue line) and CTree (green line) almost coincide for most bins and just differ slightly in a few intervals. This is not unusual because what the algorithms actually compute is rather closely related even if it comes from very different motivations.
For convenience I include the replication code for the figure below. Additionally, I show fitted GLM trees with intercepts only (blue line) and with intercept+slope (red line):
tr
## Generalized linear model tree (family: binomial)
##
## Model formula:
## default_flag ~ 1 | score
##
## Fitted party:
## [1] root
## | [2] score <= 0.06228
## | | [3] score <= -1.9528
## | | | [4] score <= -2.99924
## | | | | [5] score <= -4.01647
## | | | | | [6] score <= -4.33661: n = 6651
## | | | | | (Intercept)
## | | | | | -4.903308
## | | | | | [7] score > -4.33661: n = 3199
## | | | | | (Intercept)
## | | | | | -4.163337
## | | | | [8] score > -4.01647
## | | | | | [9] score <= -3.46188: n = 5579
## | | | | | (Intercept)
## | | | | | -3.767639
## | | | | | [10] score > -3.46188: n = 4564
## | | | | | (Intercept)
## | | | | | -3.240046
## | | | [11] score > -2.99924
## | | | | [12] score <= -2.54637: n = 4514
## | | | | (Intercept)
## | | | | -2.731446
## | | | | [13] score > -2.54637: n = 6058
## | | | | (Intercept)
## | | | | -2.258887
## | | [14] score > -1.9528
## | | | [15] score <= -1.07496
## | | | | [16] score <= -1.40411: n = 5451
## | | | | (Intercept)
## | | | | -1.676302
## | | | | [17] score > -1.40411: n = 3262
## | | | | (Intercept)
## | | | | -1.227906
## | | | [18] score > -1.07496
## | | | | [19] score <= -0.45808
## | | | | | [20] score <= -0.73189: n = 3517
## | | | | | (Intercept)
## | | | | | -0.8555708
## | | | | | [21] score > -0.73189: n = 2746
## | | | | | (Intercept)
## | | | | | -0.6578462
## | | | | [22] score > -0.45808
## | | | | | [23] score <= -0.28434: n = 1771
## | | | | | (Intercept)
## | | | | | -0.3642221
## | | | | | [24] score > -0.28434: n = 3366
## | | | | | (Intercept)
## | | | | | -0.1106296
## | [25] score > 0.06228
## | | [26] score <= 1.84071
## | | | [27] score <= 0.78299
## | | | | [28] score <= 0.39579: n = 3326
## | | | | (Intercept)
## | | | | 0.1675587
## | | | | [29] score > 0.39579: n = 3909
## | | | | (Intercept)
## | | | | 0.5967918
## | | | [30] score > 0.78299
## | | | | [31] score <= 1.09468: n = 3200
## | | | | (Intercept)
## | | | | 0.9136672
## | | | | [32] score > 1.09468
## | | | | | [33] score <= 1.38093: n = 2779
## | | | | | (Intercept)
## | | | | | 1.287567
## | | | | | [34] score > 1.38093: n = 4617
## | | | | | (Intercept)
## | | | | | 1.577804
## | | [35] score > 1.84071
## | | | [36] score <= 3.06821
## | | | | [37] score <= 2.36304: n = 5279
## | | | | (Intercept)
## | | | | 2.042677
## | | | | [38] score > 2.36304: n = 6970
## | | | | (Intercept)
## | | | | 2.697387
## | | | [39] score > 3.06821
## | | | | [40] score <= 3.97705
## | | | | | [41] score <= 3.42134: n = 3498
## | | | | | (Intercept)
## | | | | | 3.215314
## | | | | | [42] score > 3.42134: n = 5520
## | | | | | (Intercept)
## | | | | | 3.598709
## | | | | [43] score > 3.97705: n = 10224
## | | | | (Intercept)
## | | | | 4.530427
##
## Number of inner nodes: 21
## Number of terminal nodes: 22
## Number of parameters per node: 1
## Objective function (negative log-likelihood): 32057.79
tr2
## Generalized linear model tree (family: binomial)
##
## Model formula:
## default_flag ~ score | score
##
## Fitted party:
## [1] root: n = 100000
## (Intercept) score
## -0.01087159 0.99669928
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function (negative log-likelihood): 32090.08
Figure replication code:
## fitted probabilities
ndf <- data.frame(score = seq(from = -4.5, to = 4.5, by = 0.01))
ndf$prob <- predict(tr, newdata = ndf, type = "response")
ndf$prob2 <- predict(tr2, newdata = ndf, type = "response")
ndf$prob3 <- predict(tr3, newdata = ndf, type = "prob")[,2]
plot(sigmoid(score) ~ score, data = ndf, type = "l", col = 7, lwd = 4)
lines(prob ~ score, data = ndf, col = 4, lwd = 2)
lines(prob2 ~ score, data = ndf, col = 2, lwd = 2)
lines(prob3 ~ score, data = ndf, col = 3, lwd = 2)
legend("topleft",
c("sigmoid", "glmtree(d ~ 1 | s)", "glmtree(d ~ s | s)", "ctree(d ~ s)"),
col = c(7, 4, 2, 3), lwd = 2, lty = 1, bty = "n")