I have been using the survival package in R to generate Kaplan-Meier plots using tables of mortality data. I am now trying to generate the same type of plots with new data that is in a different format. I would like to convert the data so it is in the same format as the input data I have been using to succesfully make Kaplan-Meier plots.
The data I typically has the following format:
ID Entry_type Departure_type Birth_date Death_date Lifespan
234 B 1 2008-05-01 2020-01-20 10.986995
549 B 1 2014-05-25 2016-02-09 1.711157
235 B 1 2015-02-01 2017-08-01 2.496920ID
This data has one individual per row, specified by their ID. The final column is that individual's lifespan
I can generate a Kaplan-Meier plot from this table using the code:
survival_plot <- ggsurvplot(
fit = survfit(Surv(Lifespan, Depart.Type) ~ 1, data = life_span_table),
xlab = "Years",
ylab = "Overall survival probability",
surv.median.line = "hv",
title = "Survival Curve",
legend.title="")
However I now have census record data where I am tracking a cohort born in the year 1905. The data looks like this:
Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111
Explainer for the columns:
Year: The year being recorded.
Age: The age of the individual from the 1905 cohort in that year.
Total: The number of individuals born in 1905 who died in the recorded year.
BirthYear: Always 1905 for this cohort (redundant information)
CumDeath: The total number of people from the cohort who have died at the point of recording (cumulative death).
So this data format does not have an individual per row as previously. I feel the only way to make identical Kaplan-Meier plots using this data is to reformat it so that there is one individual per row and their Lifespan data is added as a new column to fit the format I usually work with.
Lifespan can be calculated as the difference between the Year and the Birthyear, but I would need to make the number of rows for each lifespan equal to the Total entry for each year. Then I can add a Entry_Type and Departure_Type column for each row that would always be 'B' for Entry_type and '1' for Departure_type.
I usually use dplyr for reformatting tables but I am not sure of the best way to add new rows for each lifespan conditional on the Total entry. Can dplyr do this or am I better off using a loop or lapply function?
Any help is greatly appreciated.
It turns out to be a rather easy exercise using R data.frame row indexing. Consider this toy example (which is pretty similar to yours):
tt = data.frame(name=c('a', 'b', 'c'),
count=c(1, 2, 3))
# extract counts specifying how many times to replicate each row
counts = tt$count
# construct row indexes
row_positions = seq_along(counts)
# or
row_positions = seq(1, nrow(tt))
# inflate vector of row indexes according to the counts
row_pos_replicated = rep(row_positions, counts)
# use inflated indexes to expand frame rows per `count` values
tt_replicated = tt[row_pos_replicated,]
We just expanded rows from tt
into new frame tt_replicated
according to the count
column:
> tt_replicated
name count
1 a 1
2 b 2
2.1 b 2
3 c 3
3.1 c 3
3.2 c 3
Solution for your frame based on above and using more condensed syntax:
census_data = read.csv(text = "Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111", sep="")
census_data_for_KMplot = census_data[rep(seq(1,
nrow(census_data)),
census_data$CumDeath),]
Let's count number of rows for each Year
(using data.table
):
> data.table::setDT(census_data_for_KMplot)[, .N, by=Year]
Year N
1: 1905 9262
2: 1906 10597
3: 1907 11111
Using data.table
for speed and somewhat simpler syntax (notice no sep=
in fread
and no ,
inside []
while row indexing):
library(data.table)
census_data = fread(text = "Year Age Total BirthYear CumDeath
1905 0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111")
census_data_for_KMplot = census_data[rep(seq_along(census_data$CumDeath),
census_data$CumDeath)]
census_data_for_KMplot[, .N, by=Year]
Year N
1: 1905 9262
2: 1906 10597
3: 1907 11111