Search code examples
joindata.tablenon-equi-join

Contingent data.table sum


I have a data.table representing three variables:

  1. Start_time (commencement of a certain procedure)
  2. Time (the actual time)
  3. Value (some value ascribed to each procedure at a given time)

I've formed a dummy set of data to represent this below:

libarary(data.table)
library(dplyr)

Start_number <- 3
Duration <- 4

Table <- rbindlist(lapply(seq_len(Start_number),function(x) data.table(Start_time = x,
                Time = as.integer(x + seq_len(Duration) - 1)) %>% mutate(Value = 2)))

(I've just put the "Value" column to be '2', however this will vary in my actual data set)

I need to create a table which has the following two columns of output:

  1. Time - each value from smallest to largest
  2. Total_FV - the sum of the 'Value' column subject to the following conditions:

Condition 1: We sum up the 'Value' column for rows where 'Time' > current row's value of 'Time' (e.g. if 'Time' = 1, we'd only sum up rows where 'Time' = 2, 3, 4, ...)

Condition 2: Where Condition 1 is met, we exclude summing up rows where 'Start_time' > 'Time'.

I'll explain what the intended output is to hopefully make more sense of the above.

Target_output <- data.table(Time = seq_len(Duration),Total_FV = c(6,10,12,6,2,0))

Row 1 of output: Need to sum up for all values of Time > 1, but only where Start_time <= 1. From the "Table" data.table, this gives us rows 2, 3 and 4 which sum up to 6.

Row 2 of output: Need to sum up for all values of Time > 2, but only where Start_time <= 2. This gives us rows 3 and 4 where Start_time = 1, and then rows 6 - 8 where Start_time = 2. In total, this gives us 10.

Row 3 of output: Need to sum up for all values of Time > 3, but only where Start_time <= 3. This gives us row 4 where Start_time = 1, rows 7 - 8 where Start_time = 2, and then rows 10 - 12 for Start_time = 3. In total, this gives us 12.

...and so on

I tried to achieve this by conducting using non-equi joins with 'Table' on itself as follows:

Result <- Table[Table,on = .(Start_time >= Time,Time > Time),.(Total_FV = sum(Value)),by = Time]

Unfortunately this looks to be a long way off.

Can someone please see if they have a data.table-like approach to this, presumably using joins?

Thanks!


Solution

  • This approach doesn't use joins, but might work for your needs.

    ts = Table[, unique(Time)]
    data.table(Time=ts, Total_FV = sapply(ts, \(t) Table[Time>t & Start_time<=t, sum(Value)]))
    

    Output:

        Time Total_FV
       <int>    <num>
    1:     1        6
    2:     2       10
    3:     3       12
    4:     4        6
    5:     5        2
    6:     6        0