For the example for the lift curve I run
library(caret)
set.seed(1)
simulated <- data.frame(obs = factor(rep(letters[1:2], each = 100)),
perfect = sort(runif(200), decreasing = TRUE),
random = runif(200))
lift2 <- lift(obs ~ random + perfect, data = simulated)
xyplot(lift2, plot = "lift", auto.key = list(columns = 2))
and get
as result. I expected the image to be swapped horizontally - something along the lines of
What am I doing wrong?
Btw: This is a lift chart not a cumulative gains chart.
The plot that I expected, produced now by my own code
mylift <- caret::lift(Class ~ cforest_prob + perfect_prob + guess_prob, data = data_test)
ggplot(mylift$data) +
geom_line(aes(CumTestedPct, lift, color = liftModelVar))
is
I noticed, that the data.frame mylift$data
contains the following columns:
names(mylift$data)
[1] "liftModelVar" "cuts" "events" "n" "Sn" "Sp" "EventPct"
[8] "CumEventPct" "lift" "CumTestedPct"
So I printed the following plot
ggplot(mylift$data) +
geom_line(aes(cuts, lift, color = liftModelVar))
So I guess that the different plots are just different ways of examining lift? I wasn't aware that there are different lift charts - I thought it was standardized across the industry.
Edit by the question author, for late readers: I accepted this answer for a large part because of the helpful discussion in the comments to this answer. Please consider reading the discussion!
Let's reproduce the graph and find the baseline. Let
cutoffs <- seq(0, 1, length = 1000)
be our cutoffs. Now the main computations are done by
aux <- sapply(cutoffs, function(ct) {
perf <- simulated$obs[simulated$perfect > ct]
rand <- simulated$obs[simulated$random > ct]
c(mean(perf == "a"), mean(rand == "a"))
})
where we go over the vector of cutoffs and do the following. Take the perfect
case. We say that whenever perfect > ct
, we are going to predict "a"
. Then simulated$obs[simulated$perfect > ct]
are the true values, while mean(perf == "a")
is our accuracy with a given cutoff. The same happens with random
.
As for the baseline, it is just a constant defined by the share of "a"
in the sample:
baseline <- mean(simulated$obs == "a")
When plotting the lifts, we divide our accuracy by that of the baseline method and get the same graph along with the baseline curve:
plot(x = cutoffs, y = aux[1, ] / baseline, type = 'l', ylim = c(0, 2), xlab = "Cutoff", ylab = "Lift")
lines(x = cutoffs, y = aux[2, ] / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')
Update:
Here's an illustration that, at least when plotted manually, the lift curve of the "expected" type can be manipulated and gives non-unique results.
Your example graph is from here, which also has this data:
# contacted response
# 1 10000 6000
# 2 20000 10000
# 3 30000 13000
# 4 40000 15800
# 5 50000 17000
# 6 60000 18000
# 7 70000 18800
# 8 80000 19400
# 9 90000 19800
# 10 100000 20000
Now suppose that we know not this evolution but 10 individual blocks:
# contacted response
# 1 10000 6000
# 2 10000 4000
# 3 10000 3000
# 4 10000 2800
# 5 10000 1200
# 6 10000 1000
# 7 10000 800
# 8 10000 600
# 9 10000 400
# 10 10000 200
In that case it depends on how we order the observations when putting "% Contacted" in the x-axis:
set.seed(1)
baseline <- sum(df$response) / sum(df$contacted) * cumsum(df$contacted)
lift1 <- cumsum(df$response)
lift2 <- cumsum(sample(df$response))
x <- 1:10 * 10
plot(x = x, y = lift1 / baseline, col = 'red', type = 'l', ylim = c(0, 3), xlab = "% Customers contacted", ylab = "Lift")
lines(x = x, y = lift2 / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')