Search code examples
kdb

how to use wj to bucket a list of timestamps in a table


I have a table that's aggregated by date, sym and client_type. The rest of the cols are not aggregated but rather lists.

So for example :

table2: select requestId, initialTimestamp by date, sym, client_type, from table

requestId and initialTimestamp are therefore lists.

I want to basically find duplicate data within 5 minutes and mark them as so.

if my 5 mins are fixed then I could do something like:

table3: update bucket:  (5 xbar\:`minute$initialTimestamp) from table2;

and then ungroup and work with the bucket to identify dupes... e.g

table4: select requestId by bucket, sym, client_type ungroup table3;
update duplicateId: ` from (ungroup update duplicateId: ?[(count each requestId)>1; requestId@'1; `] from table4) where requestId=duplicateId

This works nicely, but what If I wanted a 5 min rolling window instead of a fixed window?

it seems to point to a wj - but I'm not sure how to make it work with a grouped column.


Solution

  • If you just want to find duplicates you could possibly use fby rather than time buckets/windows?

    select from table where 1 < (count;initialTimestamp) fby ([]date;sym;client_type;requestId)
    

    Edit: I still think wj is unnecessary and won't be that efficient. You could do update by group like so:

    // given the following dummy table
    
    t:update time:"P"$"D" sv/: flip(string[date];string[time]) from ([]date:raze 20#'.z.d+til 5;sym:100?`symA`symB`symC;time:10:00+til 100;clientType:100?`clientA`clientB;requestId:100?3);
    
    update dup:00:05>=deltas time by date,sym,clientType,requestId from t
    

    Edit: Since you want the first duplicate when there are multiple then I think wj may be the only way as you originally wanted. You want to use wj1 as wj will consider the prevailing row before entering the time window.

    // sort by grouping then time
    t2:`date`sym`clientType`time xasc t; 
    times:-00:05 -00:00:00.000000001 +\:t2`time;
    
    // params
    // pair of time lists
    // common columns/grouping with time last
    // tableToJoinTo
    // (windowJoinTable;(function;col))
    
    wj1[times;`date`sym`clientType`time;t2;(update dupId:requestId from t2;(first;`dupId))]