I have binned data I'm trying to perform a survival analysis on, example data below. n
is a count of units at each group, time, failure indicator combination.
> df <- structure(list(group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("", "A", "B"), class = "factor"), t = c(0L, 1L, 2L, 3L, 1L, 2L, 3L, 0L, 1L, 2L, 3L, 1L, 2L, 3L), failure = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), n = c(40000L, 30000L, 20000L, 10000L, 5L, 4L, 3L, 20000L, 15000L, 14000L, 11000L, 10L, 6L, 4L)), .Names = c("group", "t", "failure", "n"), row.names = c(NA, 14L), class = "data.frame")
> df
group t failure n
1 A 0 0 40000
2 A 1 0 30000
3 A 2 0 20000
4 A 3 0 10000
5 A 1 1 5
6 A 2 1 4
7 A 3 1 3
8 B 0 0 20000
9 B 1 0 15000
10 B 2 0 14000
11 B 3 0 11000
12 B 1 1 10
13 B 2 1 6
14 B 3 1 4
I know I can rep
df by the n column so each row is one unit:
(ref. How do I create a survival object in R?)
> library(survival)
> df2 <- df[rep(rownames(df),df$n),]
> sfit <- survfit(Surv(t,failure)~group, data = df2)
However, my actual data has about 10 million units. Is there a way to do survival with a count/frequency variable to avoid creating a 10 million row data frame?
You'll want to use the weights
parameter. You can compare the the two approaches to confirm that you have the same output.
With your data that you repeated:
sfit <- survfit(Surv(t,failure)~group, data = df2)
Call: survfit(formula = Surv(t, failure) ~ group, data = df2)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 60012 5 1.000 3.73e-05 1.000 1
2 30007 4 1.000 7.63e-05 1.000 1
3 10003 3 0.999 1.89e-04 0.999 1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 40020 10 1.000 0.000079 1.000 1
2 25010 6 1.000 0.000126 0.999 1
3 11004 4 0.999 0.000221 0.999 1
Now using weights
weights <- df$n
sfit2 <- survfit(Surv(t,failure)~group, data = df, weights = weights)
Call: survfit(formula = Surv(t, failure) ~ group, data = df, weights = weights)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 60012 5 1.000 3.73e-05 1.000 1
2 30007 4 1.000 7.63e-05 1.000 1
3 10003 3 0.999 1.89e-04 0.999 1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 40020 10 1.000 0.000079 1.000 1
2 25010 6 1.000 0.000126 0.999 1
3 11004 4 0.999 0.000221 0.999 1