Search code examples
rr-caret

Lift curve is swapped


For the example for the lift curve I run

library(caret)
set.seed(1)
simulated <- data.frame(obs = factor(rep(letters[1:2], each = 100)),
                        perfect = sort(runif(200), decreasing = TRUE),
                        random = runif(200))
lift2 <- lift(obs ~ random + perfect, data = simulated)
xyplot(lift2, plot = "lift", auto.key = list(columns = 2))

and get

enter image description here

as result. I expected the image to be swapped horizontally - something along the lines of

enter image description here

What am I doing wrong?

Btw: This is a lift chart not a cumulative gains chart.

Update:

The plot that I expected, produced now by my own code

mylift <- caret::lift(Class ~ cforest_prob + perfect_prob + guess_prob, data = data_test)
ggplot(mylift$data) +
  geom_line(aes(CumTestedPct, lift, color = liftModelVar))

is

enter image description here

I noticed, that the data.frame mylift$data contains the following columns:

names(mylift$data)
 [1] "liftModelVar" "cuts"         "events"       "n"            "Sn"           "Sp"           "EventPct"    
 [8] "CumEventPct"  "lift"         "CumTestedPct"

So I printed the following plot

ggplot(mylift$data) +
  geom_line(aes(cuts, lift, color = liftModelVar))

enter image description here

So I guess that the different plots are just different ways of examining lift? I wasn't aware that there are different lift charts - I thought it was standardized across the industry.


Solution

  • Edit by the question author, for late readers: I accepted this answer for a large part because of the helpful discussion in the comments to this answer. Please consider reading the discussion!


    Let's reproduce the graph and find the baseline. Let

    cutoffs <- seq(0, 1, length = 1000)
    

    be our cutoffs. Now the main computations are done by

    aux <- sapply(cutoffs, function(ct) {
      perf <- simulated$obs[simulated$perfect > ct]
      rand <- simulated$obs[simulated$random > ct]
      c(mean(perf == "a"), mean(rand == "a"))
    })
    

    where we go over the vector of cutoffs and do the following. Take the perfect case. We say that whenever perfect > ct, we are going to predict "a". Then simulated$obs[simulated$perfect > ct] are the true values, while mean(perf == "a") is our accuracy with a given cutoff. The same happens with random.

    As for the baseline, it is just a constant defined by the share of "a" in the sample:

    baseline <- mean(simulated$obs == "a")
    

    When plotting the lifts, we divide our accuracy by that of the baseline method and get the same graph along with the baseline curve:

    plot(x = cutoffs, y = aux[1, ] / baseline, type = 'l', ylim = c(0, 2), xlab = "Cutoff", ylab = "Lift")
    lines(x = cutoffs, y = aux[2, ] / baseline, col = 'blue')
    abline(a = baseline / baseline, b = 0, col = 'magenta')
    

    enter image description here

    Update:

    Here's an illustration that, at least when plotted manually, the lift curve of the "expected" type can be manipulated and gives non-unique results.

    Your example graph is from here, which also has this data:

    #   contacted response
    # 1      10000     6000
    # 2      20000    10000
    # 3      30000    13000
    # 4      40000    15800
    # 5      50000    17000
    # 6      60000    18000
    # 7      70000    18800
    # 8      80000    19400
    # 9      90000    19800
    # 10    100000    20000
    

    Now suppose that we know not this evolution but 10 individual blocks:

    #    contacted response
    # 1      10000     6000
    # 2      10000     4000
    # 3      10000     3000
    # 4      10000     2800
    # 5      10000     1200
    # 6      10000     1000
    # 7      10000      800
    # 8      10000      600
    # 9      10000      400
    # 10     10000      200
    

    In that case it depends on how we order the observations when putting "% Contacted" in the x-axis:

    set.seed(1)
    baseline <- sum(df$response) / sum(df$contacted) * cumsum(df$contacted)
    lift1 <- cumsum(df$response)
    lift2 <- cumsum(sample(df$response))
    x <- 1:10 * 10
    
    plot(x = x, y = lift1 / baseline, col = 'red', type = 'l', ylim = c(0, 3), xlab = "% Customers contacted", ylab = "Lift")
    lines(x = x, y = lift2 / baseline, col = 'blue')
    abline(a = baseline / baseline, b = 0, col = 'magenta')
    

    enter image description here