Search code examples
kdb

KDB/Q query for rows matching a contiguous criteria?


I'm looking to find transactions (contiguous?) in a data set of shops which follows the trend that they eventually complete a transaction despite a few prior cancellations in a day.

A valid batch transaction must meet a set of criteria.

  1. They should be from the same shop.
  2. They should eventually be completed i.e. X amount of cancellation but 1 completion.
  3. The pending batch transactions (cancelled and completion) should not exceed a certain time frame, for example, 1 day.
  4. The transactions should have the same amount of cash tagged to be considered the 'same' transaction.
  5. Transactions should be binned by days i.e. any pending batches should not be considered as continuity for the next day.
  6. Cancelled transactions with amount in power of tens i.e. 10, 1000, 10000, should be ignored.

The query should retain all batches which meet the above criteria. The final table should have a column batch with the running total of the batches to differentiate them.

Initial Table:

shop amount status    date    
------------------------------
A    1234   Cancelled 20101010
A    1234   Cancelled 20101010
A    1234   Completed 20101010
A    1234   Cancelled 20101010
A    1234   Completed 20101011
A    1000   Completed 20101011
B    100    Cancelled 20101011
B    100    Cancelled 20101011
B    4321   Cancelled 20101011
B    4321   Cancelled 20101011
C    333    Cancelled 20101012
C    333    Completed 20101012
C    333    Completed 20101012
D    111    Cancelled 20101013
D    155    Cancelled 20101013
D    111    Completed 20101013
D    155    Completed 20101013

Delineated By Days:

shop amount status    date    
------------------------------
A    1234   Cancelled 20101010
A    1234   Cancelled 20101010
A    1234   Completed 20101010
A    1234   Cancelled 20101010
------------------------------
A    1234   Completed 20101011
A    1000   Completed 20101011
B    100    Cancelled 20101011
B    100    Cancelled 20101011
B    4321   Cancelled 20101011
B    4321   Cancelled 20101011
------------------------------
C    333    Cancelled 20101012
C    333    Completed 20101012
C    333    Completed 20101012
------------------------------
D    111    Cancelled 20101013
D    155    Cancelled 20101013
D    111    Completed 20101013
D    155    Completed 20101013

Resultant Table :

shop amount status    date     batch
-------------------------------------
A    1234   Cancelled 20101010   1
A    1234   Cancelled 20101010   1
A    1234   Completed 20101010   1
-------------------------------------
A    1234   Completed 20101011   2
A    1000   Completed 20101011   3
-------------------------------------
C    333    Cancelled 20101012   4
C    333    Completed 20101012   4
C    333    Completed 20101012   5
-------------------------------------
D    111    Cancelled 20101013   6
D    155    Cancelled 20101013   7
D    111    Completed 20101013   6
D    155    Completed 20101013   7

Table Query:

([] shop:`A`A`A`A`A`A`B`B`B`B`C`C`C`D`D`D`D; amount: 1234 1234 1234 1234 1234 1000 100 100 4321 4321 333 333 333 111 155 111 155; status:`Cancelled`Cancelled`Completed`Cancelled`Completed`Completed`Cancelled`Cancelled`Cancelled`Cancelled`Cancelled`Completed`Completed`Cancelled`Cancelled`Completed`Completed; date: `20101010`20101010`20101010`20101010`20101011`20101011`20101011`20101011`20101011`20101011`20101012`20101012`20101012`20101013`20101013`20101013`20101013)

Explanation:

  1. On the first day, A makes 4 transactions. The first three are batched together as they have the same amount [cancelled -> cancelled -> completed]. The last one transaction is ignored as it is the end of day.

  2. On the second day, A makes a transaction of the same amount of 1234 but it does not take the previous day transaction as part of its batch. A completes another transaction of 1000. B make four transaction but they not tracked as they are a) cancelled or b) powers of ten.

  3. On the third day, C makes three transaction of the same amount. This is considered two batches since the first cancellation and completion forms the initial batch, and the the final completed transaction is a batch by its own.

  4. On the fourth day, D makes four transactions and form two batches. Note that the transaction are not contiguous here since there are two cancelled transaction with varying amount, but both are completed in the future.

Table is ordered by timestamp and date i.e. 23:59:59 to 00:00:00. Query need not be a one-liner and can be a multi line query writing to any temp table/variable etc.

Additionally, if there's a way to get number of cancelled transaction per batch that will be helpful.


Solution

  • So first count the number of batches that are completed.

    q)n:count select from tab where status=`Completed
    

    Then use the below query to assign the batch numbers to each Completed row

    q)btab:update batch:1+til n from tab where status=`Completed
    q)btab
    shop amount status    date     batch
    ------------------------------------
    A    1234   Cancelled 20101010
    A    1234   Cancelled 20101010
    A    1234   Completed 20101010 1
    A    1234   Cancelled 20101010
    A    1234   Completed 20101011 2
    A    1000   Completed 20101011 3
    B    100    Cancelled 20101011
    B    100    Cancelled 20101011
    B    4321   Cancelled 20101011
    B    4321   Cancelled 20101011
    C    333    Cancelled 20101012
    C    333    Completed 20101012 4
    C    333    Completed 20101012 5
    D    111    Cancelled 20101013
    D    155    Cancelled 20101013
    D    111    Completed 20101013 6
    D    155    Completed 20101013 7
    

    Then reverse the table to fill forwards the nulls by date,shop and amount and reverse back and remove any Cancellations that are powers of 10 (using same logic as terrylynch)

    q)ftab:reverse update fills batch by date,shop,amount from reverse btab where not (status=`Cancelled)&{x=`int$x}10 xlog amount
    q)ftab
    shop amount status    date     batch
    ------------------------------------
    A    1234   Cancelled 20101010 1
    A    1234   Cancelled 20101010 1
    A    1234   Completed 20101010 1
    A    1234   Cancelled 20101010
    A    1234   Completed 20101011 2
    A    1000   Completed 20101011 3
    B    100    Cancelled 20101011
    B    100    Cancelled 20101011
    B    4321   Cancelled 20101011
    B    4321   Cancelled 20101011
    C    333    Cancelled 20101012 4
    C    333    Completed 20101012 4
    C    333    Completed 20101012 5
    D    111    Cancelled 20101013 6
    D    155    Cancelled 20101013 7
    D    111    Completed 20101013 6
    D    155    Completed 20101013 7
    

    And then select from the table and pull data that has batch numbers

    q)stab:select from ftab where batch<>0N
    q)stab
    shop amount status    date     batch
    ------------------------------------
    A    1234   Cancelled 20101010 1
    A    1234   Cancelled 20101010 1
    A    1234   Completed 20101010 1
    A    1234   Completed 20101011 2
    A    1000   Completed 20101011 3
    C    333    Cancelled 20101012 4
    C    333    Completed 20101012 4
    C    333    Completed 20101012 5
    D    111    Cancelled 20101013 6
    D    155    Cancelled 20101013 7
    D    111    Completed 20101013 6
    D    155    Completed 20101013 7
    q)
    

    Finally here is a query to get the number of cancellations per batch

    q)select numberOfCancellations:-1+count batch by batch from stab
    batch| numberOfCancellations
    -----| ---------------------
    1    | 2
    2    | 0
    3    | 0
    4    | 1
    5    | 0
    6    | 1
    7    | 1