Search code examples
kdb

How serializing foreign keyed table works internally in kdb


I have a keyed table(referenced table) linked using foreign key to the referencing table and I serialize both tables using set operator.

q)kt:([sym:`GOOG`AMZN`FB]; px:20 30 40);
q)`:/Users/uts/db/kt set kt
q)t:([] sym:`kt$5?`GOOG`AMZN`FB; vol:5?10000)
q)`:/Users/uts/db/t set t

Then I remove these tables from the memory

q)delete kt,t from `.

Now I deserialize the table t in memory:

t:get `:/Users/uts/db/t

If I do meta t after this it fails, expecting kt as foreign key. If I print t, as expected it shows index values in column sym of table t.

So, the question arises -

  1. As kdb stores the meta of each table(i.e c,t,f,a) and its corresponding values on disk, how does table t serialization works internally?

  2. How(In which form in binary format) are these values stored in file t.

    -rw-r--r-- 1 uts staff 100 Apr 13 23:09 t


Solution

  • tl;dr A foreign key is stored as a vector of 4-byte indices of a key column of a referenced table plus a name of a table a foreign key refers to.

    As far as I know kx never documented their file formats, and yet I think some useful information relevant to your question can be deduced right from a q console session.

    Let me modify your example a bit to make things simpler.

    q)show kt:([sym:`GOOG`AMZN`FB]; px:20 30 40)
    sym | px
    ----| --
    GOOG| 20
    AMZN| 30
    FB  | 40
    q)show t:([] sym:`kt$`GOOG`GOOG`AMZN`FB`FB)
    sym
    ----
    GOOG
    GOOG
    AMZN
    FB
    FB
    

    I left only one column - sym - in t because vol is not relevant to the question. Let's save t without any data first:

    q)`:/tmp/t set 0#t
    `:/tmp/t
    q)hcount `:/tmp/t
    30
    

    Now we know that it takes 30 bytes to represent t when it's empty. Let's see if there's a pattern when we start adding rows to t:

    q){`:/tmp/t set x#t;`cnt`size!(x;hcount[`:/tmp/t] - 30)} each til[11], 100 1000 1000000
    cnt     size
    ---------------
    0       0
    1       4
    2       8
    3       12
    4       16
    5       20
    6       24
    7       28
    8       32
    9       36
    10      40
    100     400
    1000    4000
    1000000 4000000
    

    We can see that adding one row increases the size of t by four bytes. What can these 4 bytes be? Can they be a representation of a symbol itself? No, because if they were and we renamed a sym value in kt it would affect the size of t on disk but it doesn't:

    q)update sym:`$50#.Q.a from `kt where sym=`GOOG
    `kt
    q)1#t
    sym
    --------------------------------------------------
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
    q)`:/tmp/t set 1#t
    `:/tmp/t
    q)hcount `:/tmp/t
    34
    

    Still 34 bytes. I think it should be obvious by now that the 4 bytes is an index, but an index of what? Is it an index of a column which must be called sym exactly? Apparently no, it isn't.

    q)kt:`foo xcol kt
    q)t
    sym
    --------------------------------------------------
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
    AMZN
    FB
    FB
    

    There's no column called sym in kt any longer but t hasn't changed at all! We can go even further and change the type of foo (ex sym) in kt:

    q)update foo:-1 -2 -3.0 from `kt
    `kt
    q)t
    sym
    ---
    -1
    -1
    -2
    -3
    -3
    

    Not only did it change t, it changed its meta too:

    q)meta t
    c  | t f  a
    ---| ------
    sym| f kt
    q)/  ^------- used to be s
    

    I hope it's clear now that kdb stores a 4-byte index of a key column of a referenced table and a name of a table (but not a key column name!). If a referenced table is missing kdb can't reconstruct the original data and displays the bare index. It a referencing table needs to be sent over the wire then indices are replaced with actual values so that the receiving side can see the real data.