Search code examples
kdb

Classifying each row based on a set number of intervals per day in KDB?


Given the following table:

Initial Table:

shop | time    
-----------
  A  | 1000
  A  | 1100
  B  | 1130
  B  | 1131
  C  | 1132
  A  | 1133
  A  | 1134
  B  | 1230
  C  | 1232
  C  | 1400

Resultant Table:

   shop | time |  mark  Number of times a shop has appeared within an interval.
   --------------------
 1)  A  | 1000 |   0    [A = 1]
 2)  A  | 1100 |   0    [A = 2]
 3)  B  | 1130 |   0    [A = 2, B = 1]
 4)  B  | 1131 |   0    [A = 2, B = 2]
 5)  C  | 1132 |   0    [A = 2, B = 2, C = 1]
 6)  A  | 1133 |   0    [A = 2, B = 2, C = 1]
 7)  A  | 1134 |   1    [A = 3 (Mark cell), B = 2, C = 1]
 8)  B  | 1230 |   1    [A = 1, B = 3 (Mark cell), C = 1]
 9)  C  | 1232 |   0    [A = 1, B = 1, C = 2]
10)  C  | 1400 |   0    [A = 1, B = 1, C = 0]

Where:

  • t = 1 hour.

  • n = 3.

For a fixed interval t, if there are n or more transactions within t, mark the row as true 1. Else mark as 0.

Explanation:

  • Lines 1 to 2, A makes two transaction within t.

  • Lines 3 to 4, B makes two transaction within t.

  • Line 5, C has its first transaction.

  • Line 6, A makes another transaction but the cell is not marked as the - interval between this and the first transaction exceeds t (1000 -> 1133).

  • Line 7, row is marked as A has n transaction within t (1100 -> 1133 -> 1134).

  • Line 8, row is marked as B has n transaction within t (1130 -> 1131 -> 1230)

  • Line 9 and 10, C makes two transactions but is unmarked as the interval exceeds t (1132 -> 1232 ---> 1400)

Additionally, this will be refreshed on a per day basis (A column DealDate is included in the format YYYYMMDD)

Essentially, this emulated a queue for each shop whereby each item pushed to the stack will be marked depending on the number of items in the queue, with each last item exceeding the interval popped.

How can I accomplish this in KDB with just Q? The dates and timestamps are ordered in descending order.


Solution

  • You can use:

    q)table:([]shop:`A`A`B`B`C`A`A`B`C`C; time:1000 1100 1130 1131 1132 1133 1134 1230 1232 1400)
    q)t:100
    q)n:3
    q)update mark:t>=(t+1)^time-(n-1)xprev time by shop from table
    shop time mark
    --------------
    A    1000 0
    A    1100 0
    B    1130 0
    B    1131 0
    C    1132 0
    A    1133 0
    A    1134 1
    B    1230 1
    C    1232 0
    C    1400 0
    

    This calculates the difference in time between each cell and the cell that occurred two rows behind it (time-(n-1)xprev time) for that shop (by shop).

    It then fills the nulls with a value greater than t as we don't want to include those cells ((t+1)^).

    Then it checks whether the earliest time of the three is within 1hr of the current time, assigning 1 where this is true (t>=).

    This can also be refreshed for each date by including by dealDate:

    table:([]dealDate:(10#20190704),10#20190705;shop:20#`A`A`B`B`C`A`A`B`C`C; time:20#1000 1100 1130 1131 1132 1133 1134 1230 1232 1400)
    q)update mark:t>=(t+1)^time-(n-1)xprev time by dealDate,shop from table
    dealDate shop time mark
    -----------------------
    20190704 A    1000 0
    20190704 A    1100 0
    20190704 B    1130 0
    20190704 B    1131 0
    20190704 C    1132 0
    20190704 A    1133 0
    20190704 A    1134 1
    20190704 B    1230 1
    20190704 C    1232 0
    20190704 C    1400 0
    20190705 A    1000 0
    20190705 A    1100 0
    20190705 B    1130 0
    20190705 B    1131 0
    20190705 C    1132 0
    20190705 A    1133 0
    20190705 A    1134 1
    20190705 B    1230 1
    20190705 C    1232 0
    20190705 C    1400 0
    

    You may wish to have your dealDate column in a date format as opposed to YYYYMMDD, as YYYYMMDD would be a long. For example you can change 20190705 by casting it to the data 2019.07.05 using "D"$string 20190705