Search code examples
rperformancedata.tablebigdatacalculated-columns

Populating a variable based on previous values and another variable in a data.table


I have order and trades data and need to calculate the remaining volume in an order after trades have happened on it. On the data below, we have volumes against orders, and trades happen when status==2.

      Order Status Volume Traded RemainingVolume
 1:   412      1    100     NA              100
 2:   412      2     NA     46               NA
 3:   412      2     NA     15               NA
 4:   412      2     NA     39               NA
 5:   538      1     10     NA               10
 6:   538      2     NA      7               NA
 7:   538      2     NA      3               NA
 8:   592      1    389     NA              389
 9:   592      2     NA     95               NA
10:   648      1    100     NA              100
11:   648      2     NA    100               NA
12:   885      1     50     NA               50
13:   885      2     NA     47               NA
14:   885      2     NA      3               NA
15:   950      1     39     NA               39
16:   950      2     NA     39               NA

I just need to calculate the remaining volume for every order. This is calculated by subtracting the Traded (volume) from the Volume. Since the RemainingVolume at the beginning (status==1) is anyway set to Volume, this is simply (RemainingVolume-Traded), that s all. In other words, I need the following output:

     Order Status Volume Traded RemainingVolume
 1:   412      1    100     NA              100
 2:   412      2     NA     46               54
 3:   412      2     NA     15               39
 4:   412      2     NA     39                0
 5:   538      1     10     NA               10
 6:   538      2     NA      7                3
 7:   538      2     NA      3                0
 8:   592      1    389     NA              389
 9:   592      2     NA     95              294
10:   648      1    100     NA              100
11:   648      2     NA    100                0
12:   885      1     50     NA               50
13:   885      2     NA     47                3
14:   885      2     NA      3                0
15:   950      1     39     NA               39
16:   950      2     NA     39                0

Notice how remaining volume has been filled for orders 412, 538, 592, 648, 885 and 950. Assuming the data is in a table mz, rem is RemainingVolume, trdq is Traded, I tried the following:

for (i in 2:nrow(mz)){
  if (is.na(mz[i]$rem))mz[i]$rem = mz[i-1]$rem - mz[i]$trdq
}

Works, but is really, really slow. The data here has many millions of rows, hence only data.table solution is feasible. So I tried:

mz[,rem:= ifelse(is.na(rem), shift(rem, 1)-trdq, rem), by = ord]

No answer here as well. I get the RemainingVolume for the first trade, but that's it. The trades below remain NA. What am I missing here? I'm new to data.table, and so something very simple, may be.

Performance is truly key here, given the size of the data. Any help is much appreciated.


Solution

  • baseR way of doing it

    dt$RemainingVolume <- ave(replace(dt$RemainingVolume, dt$Status ==2, -1*dt$Traded[dt$Status ==2]), dt$Order, FUN = function(x) cumsum(x))
    
    dt
        Order Status Volume Traded RemainingVolume
     1:   412      1    100     NA             100
     2:   412      2     NA     46              54
     3:   412      2     NA     15              39
     4:   412      2     NA     39               0
     5:   538      1     10     NA              10
     6:   538      2     NA      7               3
     7:   538      2     NA      3               0
     8:   592      1    389     NA             389
     9:   592      2     NA     95             294
    10:   648      1    100     NA             100
    11:   648      2     NA    100               0
    12:   885      1     50     NA              50
    13:   885      2     NA     47               3
    14:   885      2     NA      3               0
    15:   950      1     39     NA              39
    16:   950      2     NA     39               0
    

    sample data used

    dt <- structure(list(Order = c(412L, 412L, 412L, 412L, 538L, 538L, 
    538L, 592L, 592L, 648L, 648L, 885L, 885L, 885L, 950L, 950L), 
        Status = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 
        2L, 2L, 1L, 2L), Volume = c(100L, NA, NA, NA, 10L, NA, NA, 
        389L, NA, 100L, NA, 50L, NA, NA, 39L, NA), Traded = c(NA, 
        46L, 15L, 39L, NA, 7L, 3L, NA, 95L, NA, 100L, NA, 47L, 3L, 
        NA, 39L), RemainingVolume = c(100, 54, 39, 0, 10, 3, 0, 389, 
        294, 100, 0, 50, 3, 0, 39, 0)), row.names = c(NA, -16L), class = "data.frame")