Search code examples
rsurvival-analysis

Appropriate censoring and truncation for customer survival analysis


I am working on a regular customer survival analysis problem. Here I analyze customers who signed up between 2008-1-1 & 2018-1-1. Customers can register anytime during this interval and exit anytime during or after the cut-off date of 2018-1-1.

A sample data is shown below. First column is an identifier, second column is their status as of 2018-1-1: '1 for canceled & 0 for not-cancelled'. Third column is the number of weeks between their registration date & 2008-1-1. Last column is the number of weeks between their cancellation date and 2008-1-1 (if cancelled before 2018-1-1) or number of weeks between 2008-1-1 and 2018-1-1 (if not cancelled or cancelled after 2018-1-1).

enter image description here

dput() to generate the above dataset

structure(list(PrimaryConstituentSKey = c(1370591L, 1225587L, 
1264156L, 1266355L, 3080025L), Cancelled = c(1, 1, 1, 1, 0), 
startTime = c(0, 0, 0, 1, 101), stopTime = c(10, 34, 5, 9, 
123)), row.names = c(NA, -5L), class = "data.frame")

I will use this data to create a 'Survival object' which later will be used a response variable for my survival model.

If my assumption is right (the data is left truncated and right censored), is the below code correct to generate a survival object?

S <- Surv(time = df$startTime, time2 = df$stopTime, event = df$Cancelled)

model <- survfit(S ~ predictor1 + predictor2+.., data = df)

Question2: I tried plotting the survival curves grouped by vendor to see how each vendor performs. Surprisingly some vendors have their starting sometime towards mid of the duration where as I was expecting all them to start from zero. When I checkd the data, those vendors are comparatively new and have been in the picture only for past few years. To compare them properly, all of them should have the same starting point and this make me suspicious that my survival object is wrong. Appreciate if some one can help me with this also.

model <- survfit(S ~ Vendor, data = df)

ggsurvplot(fit = model, data = df, linetype = "strata")+xlab('duration in 
months')+ylab('retention rate')

enter image description here

Sorry for the lengthy questions. Thank you


Solution

  • After a bit of extra research and consulting with experts, I could sort out the issue.

    My data is indeed left truncated (as customers can sign up anytime during the 10 year time period ) and right censored (some customers are still active as on 2018-1-1 and cancel anytime any time afterwards). Below corrections helped me to fix the issue.

    1. I need to calculate the 'stopTime' as the difference between Registration date & Cancellation date (in case if cancelled before 2018-1-1) or 2018-1-1 (if not cancelled or cancelled after 2018-1-1).
    2. Cancelled status should be the status as of 2018-1-1.
    3. Code to create the survival object should be modified as below

        S <- Surv(time = df$stopTime, event = df$Cancelled, type = "right")
      
    4. As a best practice, it is advisable to create the survival object at the same time as we define the model, as below

        model <- survfit(Surv(stopTime, Cancelled) ~ Vendor, df) 
      

      This plotted me the graph with all the curves having '0' origin.

    enter image description here