Here is my simplified data.table:
Individual | time_alive (day) |
---|---|
ID1 | 1 |
ID2 | 5 |
ID3 | 7 |
ID4 | 5 |
I need to calculate the number of individual alive at every day. I achieved this by doing a loop
for (i in c(-1:600)) {
y<-summarise(DT , time_alive > i )
Alive[i+2,]<-length(y[y==TRUE])
}
However this is really long with a data.frame of more than 2B observations.
I wanted to try an alternative with data.table but I am stuck at only 1 day number of alive calculation:
DT[,.N,time_alive> i][time_alive==TRUE,2]
Here, i cannot be replaced by a vector but only by 1 number. I want to calculate the number of individual with more than i days of life, without doing a loop.
My result expected for the simplified data would be:
Day | Number of individual alive |
---|---|
1 | 4 |
2 | 3 |
3 | 3 |
4 | 3 |
5 | 3 |
6 | 1 |
7 | 1 |
8 | 0 |
Best solution in one line, with data.table which is much faster than looping:
DT[, .(Day = seq_len(1 + max(time_alive)))][DT[,.(time_alive)], .(.N), on = .(Day <= time_alive), by = Day]
# @r2evans suggestion about making it a one-liner
# replaced res = data.table('day' = 1:max(DT$time_alive))
DT[, .(day = seq_len(1 + max(time_alive)))][
# my original solution
DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
# or
DT[,time_alive > TARGET_NUMBER, by = individual]
I have two solutions based on what you have provided. One or both of them should be what you're looking for. See below for details/explanation
# load in data
DT = data.table('individual' = 1:4, 'time_alive' = c(1,5,7,5))
# set your target number
TARGET_NUMBER = 5
# group by individual,
# then check if the number of days they were alive is greater than your target
# this answers "i want to calculate the number of
# individual with more than "i" days of life
DT[,time_alive > TARGET_NUMBER, by = individual]
individual V1
1: 1 FALSE
2: 2 FALSE
3: 3 TRUE
4: 4 FALSE
# if the result you want is that table you created. that is a little different:
# create a table with days ranging from 1 to the maximum survivor
res = data.table('day' = 1:max(DT$time_alive))
day
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
# use joins
# join by time alive being greater than or equal to the day
# group by the specific day, and count how many observations we have
# allow.cartesian because the mapping isn't one-to-one
res[DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
day N
1: 1 4
2: 2 3
3: 3 3
4: 4 3
5: 5 3
6: 6 1
7: 7 1