I am working on a regular customer survival analysis problem. Here I analyze customers who signed up between 2008-1-1 & 2018-1-1. Customers can register anytime during this interval and exit anytime during or after the cut-off date of 2018-1-1.
A sample data is shown below. First column is an identifier, second column is their status as of 2018-1-1: '1 for canceled & 0 for not-cancelled'. Third column is the number of weeks between their registration date & 2008-1-1. Last column is the number of weeks between their cancellation date and 2008-1-1 (if cancelled before 2018-1-1) or number of weeks between 2008-1-1 and 2018-1-1 (if not cancelled or cancelled after 2018-1-1).
dput() to generate the above dataset
structure(list(PrimaryConstituentSKey = c(1370591L, 1225587L,
1264156L, 1266355L, 3080025L), Cancelled = c(1, 1, 1, 1, 0),
startTime = c(0, 0, 0, 1, 101), stopTime = c(10, 34, 5, 9,
123)), row.names = c(NA, -5L), class = "data.frame")
I will use this data to create a 'Survival object' which later will be used a response variable for my survival model.
Theoretical questions
whcih I asked in cross-validated but got no response yet (https://stats.stackexchange.com/questions/423802/appropriate-censoring-and-truncation-for-customer-survival-analysis): I am wondering if this approach makes sense ? I am especially interested to know what sort of censoring /truncation is suitable in this scenario ?I believe it is left truncated (as people can join any time after 2008-1-1) and right censored (some of them would have left sometime after 2018-1-1 also).
Coding questions:
If my assumption is right (the data is left truncated and right censored), is the below code correct to generate a survival object?
S <- Surv(time = df$startTime, time2 = df$stopTime, event = df$Cancelled)
model <- survfit(S ~ predictor1 + predictor2+.., data = df)
Question2: I tried plotting the survival curves grouped by vendor to see how each vendor performs. Surprisingly some vendors have their starting sometime towards mid of the duration where as I was expecting all them to start from zero. When I checkd the data, those vendors are comparatively new and have been in the picture only for past few years. To compare them properly, all of them should have the same starting point and this make me suspicious that my survival object is wrong. Appreciate if some one can help me with this also.
model <- survfit(S ~ Vendor, data = df)
ggsurvplot(fit = model, data = df, linetype = "strata")+xlab('duration in
months')+ylab('retention rate')
Sorry for the lengthy questions. Thank you
After a bit of extra research and consulting with experts, I could sort out the issue.
My data is indeed left truncated (as customers can sign up anytime during the 10 year time period ) and right censored (some customers are still active as on 2018-1-1 and cancel anytime any time afterwards). Below corrections helped me to fix the issue.
Code to create the survival object should be modified as below
S <- Surv(time = df$stopTime, event = df$Cancelled, type = "right")
As a best practice, it is advisable to create the survival object at the same time as we define the model, as below
model <- survfit(Surv(stopTime, Cancelled) ~ Vendor, df)
This plotted me the graph with all the curves having '0' origin.