I have a dataset which is right censored containing information of life times and different types of deaths for a given sample and I want to produce a plot of a survival curve (with the actual values that would be calculated from the sample and not from a model estimation) with the different types of death as a stacked area chart, something like this:
How can I accomplish this in R?
The dataset would look something like this:
death type time event
1 Type3 81 1
2 NA 868 0
3 Type3 1022 1
4 NA 868 0
5 NA 868 0
6 NA 868 0
7 NA 868 0
8 NA 887 0
9 Type3 156 1
10 NA 868 0
11 NA 868 0
12 NA 868 0
13 Type3 354 1
14 Type3 700 1
15 Type3 632 1
16 NA 868 0
17 Type1 308 1
18 NA 1001 0
19 NA 1054 0
20 NA 1059 0
21 Type3 120 1
22 NA 732 0
23 Type3 543 1
24 Type1 379 1
25 NA 613 0
26 NA 1082 0
27 Type3 226 1
28 Type2 1 0
29 NA 976 0
30 NA 1000 0
31 NA 706 0
32 NA 1015 0
33 Type3 882 1
34 NA 1088 0
35 NA 642 0
36 Type3 953 1
37 NA 1068 0
38 NA 819 0
39 NA 1029 0
40 Type3 34 1
41 NA 1082 0
42 Type3 498 1
43 NA 923 0
44 NA 1041 0
45 Type3 321 1
46 NA 557 0
47 NA 628 0
48 Type3 197 1
49 Type3 155 1
50 NA 955 0
Where death type with NA indicates censored data, time is the time of death or time of censoring, and event is 1 for those who are dead and 0 for those who are censored. (This is the format required by 'survfit' but I also have it with actual start and end times as dates)
(Now, with only 50 points it wouldn't be possible to construct such a curve, but the data has a lot more rows that wouldn't fit here).
It's an ugly bit of code, but it gets the idea in. I didn't take the time to figure out how to add the legend. Please also note that this kind of figure, while interesting in concept, isn't necessarily going to mirror a KM curve. To be honest, if you're going to present the data this way, it makes more sense to do it as stacked bars at fixed time points.
Please note, I'm pretty sure there are some flaws lurking in this code. It comes with no warranty, but might get you started.
SurvData <- structure(list(row.names = c("", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", ""), death = 1:50, type = c("Type3",
NA, "Type3", NA, NA, NA, NA, NA, "Type3", NA, NA, NA, "Type3",
"Type3", "Type3", NA, "Type1", NA, NA, NA, "Type3", NA, "Type3",
"Type1", NA, NA, "Type3", "Type2", NA, NA, NA, NA, "Type3", NA,
NA, "Type3", NA, NA, NA, "Type3", NA, "Type3", NA, NA, "Type3",
NA, NA, "Type3", "Type3", NA), time = c(81L, 868L, 1022L, 868L,
868L, 868L, 868L, 887L, 156L, 868L, 868L, 868L, 354L, 700L, 632L,
868L, 308L, 1001L, 1054L, 1059L, 120L, 732L, 543L, 379L, 613L,
1082L, 226L, 1L, 976L, 1000L, 706L, 1015L, 882L, 1088L, 642L,
953L, 1068L, 819L, 1029L, 34L, 1082L, 498L, 923L, 1041L, 321L,
557L, 628L, 197L, 155L, 955L), event = c(1L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L,
0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), .Names = c("row.names",
"death", "type", "time", "event"), class = "data.frame", row.names = c(NA,
-50L))
library(dplyr)
library(zoo)
library(RColorBrewer)
SurvDataSummary <-
arrange(SurvData, time, type) %>%
mutate(type = ifelse(is.na(type), "Alive", type)) %>%
group_by(time) %>%
#* Count the number of each type at each time point
summarise(n_at_time = n(),
alive_at_time = sum(type == "Alive"),
type1_at_time = sum(type == "Type1"),
type2_at_time = sum(type == "Type2"),
type3_at_time = sum(type == "Type3")) %>%
ungroup() %>%
mutate(n_alive = sum(n_at_time) - cumsum(lag(n_at_time, default = 0)),
#* Proportion of each type
p_type1_at_time = type1_at_time / n_alive,
p_type2_at_time = type2_at_time / n_alive,
p_type3_at_time = type3_at_time / n_alive,
#* convert 0 to NA
p_type1_at_time = ifelse(p_type1_at_time == 0, NA, p_type1_at_time),
p_type2_at_time = ifelse(p_type2_at_time == 0, NA, p_type2_at_time),
p_type3_at_time = ifelse(p_type3_at_time == 0, NA, p_type3_at_time),
#* Back fill NAs with last known value
p_type1_at_time = na.locf(p_type1_at_time, FALSE),
p_type2_at_time = na.locf(p_type2_at_time, FALSE),
p_type3_at_time = na.locf(p_type3_at_time, FALSE),
#* make leading NAs 0
p_type1_at_time = ifelse(is.na(p_type1_at_time), 0, p_type1_at_time),
p_type2_at_time = ifelse(is.na(p_type2_at_time), 0, p_type2_at_time),
p_type3_at_time = ifelse(is.na(p_type3_at_time), 0, p_type3_at_time),
#* Calculate cumulative proportions
p_alive_at_time = 1 - p_type1_at_time - p_type2_at_time - p_type3_at_time,
cump_type1_at_time = p_alive_at_time + p_type1_at_time,
cump_type2_at_time = cump_type1_at_time + p_type2_at_time,
cump_type3_at_time = cump_type2_at_time + p_type3_at_time,
#* Get the following time for using geom_rect
next_time = lead(time)) %>%
pal <- brewer.pal(4, "PRGn")
ggplot(SurvDataSummary,
aes(xmin = time,
xmax = next_time)) +
geom_rect(aes(ymin = 0, ymax = p_alive_at_time), fill = pal[1]) +
geom_rect(aes(ymin = p_alive_at_time, ymax = cump_type1_at_time), fill = pal[2]) +
geom_rect(aes(ymin = cump_type1_at_time, ymax = cump_type2_at_time), fill = pal[3]) +
geom_rect(aes(ymin = cump_type2_at_time, ymax = cump_type3_at_time), fill = pal[4])