I need to reshape my data, to get it in a proper format for Survival Analysis.
My current Dataset looks like this:
Product_Number Date Status
A 2018-01-01 0
A 2018-01-02 1
A 2018-01-03 0
B 2018-01-01 0
B 2018-01-02 0
B 2018-01-03 0
B 2018-01-04 1
C 2018-01-01 0
C 2018-01-02 0
I need to reshape my data, based on the columns Product_Number, Date and Status (I want to count the number of days, per product, until the status shift to a 1. If the status is 0, the proces should start over again).
So the data should look like this:
Product_Number Number_of_Days Status
A 2 1 #Two days til status = 1
A 1 0 #One day, status = 0 (no end date yet)
B 4 1 #Four days til status = 1
C 2 0 #Two days, status is still 0 (no end date yet)
What have I tried so far?
I ordered my data by ProductNumber and Date. I love the DPLYR way, so I used:
df <- df %>% group_by(Product_Number, Date) # note: my data is now in the form as in the example above.
Then I tried to use the diff() function, to see the differences in dates (count the number of days). But I was unable to "stop" the count, when status switched (from 0 to 1, and vice versa).
I hope that I clearly explained the problem. Please let me know if you need some additional information.
You could do:
library(dplyr)
df %>%
group_by(Product_Number) %>%
mutate(Date = as.Date(Date),
group = cumsum(coalesce(as.numeric(lag(Status) == 1 & Status == 0), 1))) %>%
group_by(Product_Number, group) %>%
mutate(Number_of_Days = (last(Date) - first(Date)) + 1) %>%
slice(n()) %>% ungroup() %>%
select(-group, -Date)
Output:
# A tibble: 4 x 3
Product_Number Status Number_of_Days
<chr> <int> <time>
1 A 1 2
2 A 0 1
3 B 1 4
4 C 0 2