Search code examples
rdplyrdatatablesurvival-analysissurvival

How to convert a mortality table to generate a Kaplan-Meier plot


I have been using the survival package in R to generate Kaplan-Meier plots using tables of mortality data. I am now trying to generate the same type of plots with new data that is in a different format. I would like to convert the data so it is in the same format as the input data I have been using to succesfully make Kaplan-Meier plots.

The data I typically has the following format:

ID Entry_type Departure_type Birth_date Death_date  Lifespan
234 B 1 2008-05-01 2020-01-20 10.986995
549 B 1 2014-05-25 2016-02-09 1.711157
235 B 1 2015-02-01 2017-08-01 2.496920ID

This data has one individual per row, specified by their ID. The final column is that individual's lifespan

I can generate a Kaplan-Meier plot from this table using the code:

survival_plot <- ggsurvplot(
    fit = survfit(Surv(Lifespan, Depart.Type) ~ 1, data = life_span_table), 
    xlab = "Years", 
    ylab = "Overall survival probability",
    surv.median.line = "hv",
    title = "Survival Curve",
    legend.title="")

However I now have census record data where I am tracking a cohort born in the year 1905. The data looks like this:

Year Age Total BirthYear CumDeath
1905  0 9262 1905 9262
1906 1 1335 1905 10597
1907 2 514 1905 11111

Explainer for the columns:

Year: The year being recorded.
Age: The age of the individual from the 1905 cohort in that year.
Total: The number of individuals born in 1905 who died in the recorded year.
BirthYear: Always 1905 for this cohort (redundant information)
CumDeath: The total number of people from the cohort who have died at the point of recording (cumulative death). 

So this data format does not have an individual per row as previously. I feel the only way to make identical Kaplan-Meier plots using this data is to reformat it so that there is one individual per row and their Lifespan data is added as a new column to fit the format I usually work with.

Lifespan can be calculated as the difference between the Year and the Birthyear, but I would need to make the number of rows for each lifespan equal to the Total entry for each year. Then I can add a Entry_Type and Departure_Type column for each row that would always be 'B' for Entry_type and '1' for Departure_type.

I usually use dplyr for reformatting tables but I am not sure of the best way to add new rows for each lifespan conditional on the Total entry. Can dplyr do this or am I better off using a loop or lapply function?

Any help is greatly appreciated.


Solution

  • It turns out to be a rather easy exercise using R data.frame row indexing. Consider this toy example (which is pretty similar to yours):

    tt = data.frame(name=c('a', 'b', 'c'), 
                    count=c(1, 2, 3))
    # extract counts specifying how many times to replicate each row
    counts = tt$count
    # construct row indexes
    row_positions = seq_along(counts)
    # or
    row_positions = seq(1, nrow(tt))
    # inflate vector of row indexes according to the counts
    row_pos_replicated = rep(row_positions, counts)
    # use inflated indexes to expand frame rows per `count` values
    tt_replicated = tt[row_pos_replicated,]
    

    We just expanded rows from tt into new frame tt_replicated according to the count column:

    > tt_replicated
           name  count
    1      a     1
    2      b     2
    2.1    b     2
    3      c     3
    3.1    c     3
    3.2    c     3
    

    Solution for your frame based on above and using more condensed syntax:

    census_data = read.csv(text = "Year Age Total BirthYear CumDeath
    1905 0 9262 1905 9262
    1906 1 1335 1905 10597
    1907 2 514 1905 11111", sep="")
    
    census_data_for_KMplot = census_data[rep(seq(1,
                                                 nrow(census_data)),
                                             census_data$CumDeath),]
    

    Let's count number of rows for each Year (using data.table):

    > data.table::setDT(census_data_for_KMplot)[, .N, by=Year]
       Year     N
    1: 1905  9262
    2: 1906 10597
    3: 1907 11111
    

    Using data.table for speed and somewhat simpler syntax (notice no sep= in fread and no , inside [] while row indexing):

    library(data.table)
    
    census_data = fread(text = "Year Age Total BirthYear CumDeath
    1905 0 9262 1905 9262
    1906 1 1335 1905 10597
    1907 2 514 1905 11111")
    
    census_data_for_KMplot = census_data[rep(seq_along(census_data$CumDeath), 
                                         census_data$CumDeath)]
    census_data_for_KMplot[, .N, by=Year]
    
       Year     N
    1: 1905  9262
    2: 1906 10597
    3: 1907 11111