Search code examples
rlistdata.tablecumsum

Calculate the number of individual with more than i days of life with data.table in R


Here is my simplified data.table:

Individual time_alive (day)
ID1 1
ID2 5
ID3 7
ID4 5

I need to calculate the number of individual alive at every day. I achieved this by doing a loop

for (i in c(-1:600)) {
  y<-summarise(DT , time_alive > i )
  Alive[i+2,]<-length(y[y==TRUE])
}

However this is really long with a data.frame of more than 2B observations.

I wanted to try an alternative with data.table but I am stuck at only 1 day number of alive calculation:

DT[,.N,time_alive> i][time_alive==TRUE,2] 

Here, i cannot be replaced by a vector but only by 1 number. I want to calculate the number of individual with more than i days of life, without doing a loop.

My result expected for the simplified data would be:

Day Number of individual alive
1 4
2 3
3 3
4 3
5 3
6 1
7 1
8 0

Best solution in one line, with data.table which is much faster than looping:

DT[, .(Day = seq_len(1 + max(time_alive)))][DT[,.(time_alive)], .(.N), on = .(Day <= time_alive), by = Day]

Solution

  • # @r2evans suggestion about making it a one-liner
    # replaced res = data.table('day' = 1:max(DT$time_alive))
    DT[, .(day = seq_len(1 + max(time_alive)))][
         # my original solution
         DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
    
    # or 
    
    DT[,time_alive > TARGET_NUMBER, by = individual]
    

    I have two solutions based on what you have provided. One or both of them should be what you're looking for. See below for details/explanation

    # load in data
    DT = data.table('individual' = 1:4, 'time_alive' = c(1,5,7,5))
    # set your target number
    TARGET_NUMBER = 5
    
    # group by individual, 
    # then check if the number of days they were alive is greater than your target
    # this answers "i want to calculate the number of 
    # individual with more than "i" days of life
    
    DT[,time_alive > TARGET_NUMBER, by = individual]
    
    individual    V1
    1:          1 FALSE
    2:          2 FALSE
    3:          3  TRUE
    4:          4 FALSE
    
    # if the result you want is that table you created. that is a little different:
    # create a table with days ranging from 1 to the maximum survivor
    
    res = data.table('day' = 1:max(DT$time_alive))
    
    day
    1:   1
    2:   2
    3:   3
    4:   4
    5:   5
    6:   6
    7:   7
    
    
    # use joins
    # join by time alive being greater than or equal to the day
    # group by the specific day, and count how many observations we have
    # allow.cartesian because the mapping isn't one-to-one
    
    res[DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
    
    day N
    1:   1 4
    2:   2 3
    3:   3 3
    4:   4 3
    5:   5 3
    6:   6 1
    7:   7 1