Search code examples
kdbq-lang

Trying to aggregate by mean and then trim outliers of observations based on the mean in kdb


I'm trying to find the average value by item type, and then set a variable y that takes the value:

  1. missing value if x = 0,
  2. y = x. Then round values that are too big or too small:
  3. if x > 0, x > mu, then set y = mu
  4. if x < 0, x < mu, then set y = mu

This is what I've tried, which doesn't produce the desired result below:

 tab2:([]items:`a`b`a`a`b; x:-6 8 0 -3 5)
 tabsum: select mu:avg x by items from tab2;
 tab2: update y:x from tab2
 tab2: update y:mu from tab2 where x > 0 and x > mu / get error after running above step
 tab2: update y:mu from tab2 where x < 0 and x <= mu

Desired result:

 items  x   mu    y
     a -6 -3.0 -3.0
     b  8  6.5  6.5
     a  0 -3.0  NaN
     a -3 -3.0 -3.0
     b  5  6.5  5

Related to this data:

i) Is NaN the appropriate type for missing values in kdb? (NA is different to NaN or NULL in R for example). I'm guessing so based on what I've read so far.

ii) Is there more efficient code to get the mu column inside tab2? Making another table and merging I'm guessing isn't efficient (still learning basics of kdb)

iii) If I just run

tab2:([]items:`a`b`a`a`b; x:-6 8 0 -3 5)
     tabsum: select mu:avg x by items from tab2;
     tab2: update y:mu from tab2 where x > 0 and x > mu

I get:

  items  x   mu   y
1     a -6 -3.0 NaN
2     b  8  6.5 6.5
3     a  0 -3.0 NaN
4     a -3 -3.0 NaN
5     b  5  6.5 6.5

Line 5 doesn't make sense to me. Why is y = 6.5 if x < mu? I expect y =5 for that row. Clearly my understanding of what's going on is wrong.

iv) How can I get the desired result (the code I've posted doesn't work properly)


Solution

  • This might not be the prettiest solution, but seems to fit your criteria. I'm using a vector conditional ? to set the value of y:

    q)show tab3: update y:?[((x>0) and x>mu) or ((x<0) and x<=mu);mu;x] from update mu:avg x by items from tab2
    items x  mu  y  
    ----------------
    a     -6 -3  -3 
    b     8  6.5 6.5
    a     0  -3  0  
    a     -3 -3  -3 
    b     5  6.5 5
    

    And then replacing any y=0 values with null 0n (rather than NaN):

    q)update y:0n from tab3 where y=0
    items x  mu  y  
    ----------------
    a     -6 -3  -3 
    b     8  6.5 6.5
    a     0  -3     
    a     -3 -3  -3 
    b     5  6.5 5
    

    The unexpected result on your last line is down to order of execution; q code is executed right-to-left. So:

    5>0 and 5>6.5
    

    actually means:

    5>(0 and 5>6.5)
    

    which is:

    5>0
    

    which evaluates to true. To make the desired comparison, you need to use brackets:

    q)(5>0) and 5>6.5
    0b
    

    There's a few reasons why your block of code doesn't work. On line 2, you don't update tab2 with the mu column; you just create a new table. So consequently when you try to use mu on line 4, it doesn't exist (which will cause an error). On line 3 you set the values of y to be the same as x, but this also means the same type (integer). Then later when you try to reset y to a float value from mu you get a 'type error.

    Here's a block of code similar to yours that works:

    q)tab2:([]items:`a`b`a`a`b; x:-6 8 0 -3 5)
    q)tab2: update mu:avg x by items from tab2
    q)tab2: update y:"f"$x from tab2
    q)tab2: update y:mu from tab2 where (x>0) and (x> mu)
    q)tab2: update y:mu from tab2 where (x<0) and (x<= mu)
    q)tab2
    items x  mu  y  
    ----------------
    a     -6 -3  -3 
    b     8  6.5 6.5
    a     0  -3  0  
    a     -3 -3  -3 
    b     5  6.5 5