Search code examples
group-bykdb

kdb/q question: How do I interpret this groupby in my functional selection?


I am new to kdb/q and am trying to figure out what this particular query means. The code is using functional select, which I am not overly comfortable with.

?[output;();b;a];

where output is some table which has columns size time symbol

the groupby filter dictionary b is defined as follows

key | value
---------------
ts  | ("+";00:05:00v;("k){x*y div x:$[16h=abs[@x];"j"$x;x]}";00:05:00v;("%:";`time)))
sym | ("k){x'y}";"{`$(,/)("/" vs string x)}";`symbol)

For the sake of completeness, dictionary a is defined as

volume  ("sum";`size)  

In effect, the functional select seems to be bucketing the data into 5 minute buckets and doing some parsing in symbol. What baffles me is how to read the groupby dictionary. Especially the k)" part and the entire thing being in quotes. Can someone help me go through this or point me to resources that can help me understand? Any input will be appreciated.


Solution

  • The aggregation part of the function form takes a dictionary, the key being the output key column names and the values being parse tree functions.

    A parse tree is an expression that is not immediately evaluated. The first argument as a function and subsequent elements are its arguments. The inner-most brackets are evaluated first and then it moves up the heirarchy, evaluating each one in turn. More detailed information can be found here and in the whitepaper linked on that page

    You can use the function parse with a string argument to get the parse tree of a function. For example, the parse tree for 1+2+3 is (+;1;(+;2;3)):

    q)parse "1+2+3"
    +
    1
    (+;2;3)
    

    The inner-most bracket (+;2;3) is evaluated first resulting in 5, before the result is propogated up to the outmost parse tree function (+;1;5) giving 6

    The groupby part of the clause will evaluate one or more parse tree functions and then will collect together records with the same output from the grouping function.

    Making the function a bit clearer to read:

    (+;00:05:00v;({x*y div x:$[16h=abs[@x];"j"$x;x]}";00:05:00v;(%:;`time)))
    

    Looking at the inner most bracket (%:;`time), it returns the result of %: applied on the time column. We can see that %: is k for the function ltime

    q)ltime
    %:
    

    Moving up a level, the next function evaluated is the lambda function {x*y div x:$[16h=abs[@x];"j"$x;x]} with arguments 00:05:00v and the result of our previous evaluated function. The lambda rounds it down the the nearest 5 minute interval

    ({x*y div x:$[16h=abs[@x];"j"$x;x]};00:05:00v;(%:;`time))
    

    Moving up once more to the whole expression it is equivalent to 00:05:00v + {x*y div x:$[16h=abs[@x];"j"$x;x]};00:05:00v;(%:;`time)), with 00:05:00 being added onto each result from the previous evaluation.

    So essentially it first returns the local time of the timestamp, then

    For the symbol aggregation

    ("k""{x'y}";{`$(,/)("/" vs string x)};`symbol)
    

    The inner function {`$(,/)("/" vs string x)} strings a symbol, splits it at "/" character and then joins it back together, effectively removing the slash

    The "k" is a function that evaluates the string using the k interpreter.

    "k""{x'y}"" returns a function which itself takes a function x and argument y and modifies the function to use the each-both adverb '. This makes it so that the function x is applied on each symbol individually as opposed to the column as a whole.

    This could be implemented in q instead of k like so:

    ({x@'y};{`$(,/)("/" vs string x)};`symbol)
    

    The function {x@'y} takes the function argument {`$(,/)("/" vs string x)} and the symbol column as before, but we have to use @ with the each-both adverb in q to apply the function on the arguments.

    The aggregation function will then be applied to each group. In your case the function is a simple parse tree, which will return the sum of the size columns in each group, with the output column called volume

    a:enlist[`volume]!enlist (sum;`size)