Search code examples
nullkdb+q

Remove rows containing null values efficiently


What is the most efficient way to remove rows containing any null values, for a table with many rows?

Additionally, why do I get the error

ERROR: 'time (attempt to use variable time without defining/assigning first

even though I have nothing named as ‘time’ when using the functional select

?[t;{(not;(null;x))} each cols t; 0b; ()]

given in this S/Oanswer?


Solution

  • Is your table in memory or splayed on disk?

    If your table is in memory, then a simple form might also be the fastest.

    show t:flip`a`b`c`d`e!(1+til 100)*/:100 cut 500?(1 0N)where 19 1
    a  b  c  d  e
    --------------
    1  1     1  1
    2  2  2  2  2
    3  3  3  3  3
    4  4  4  4  4
    5  5  5  5  5
    6  6  6  6  6
    7  7  7  7  7
    8  8  8  8  8
    9  9  9  9  9
    10 10 10 10 10
    11 11 11 11 11
    12 12 12 12 12
    13    13 13 13
    14 14 14 14 14
    15 15 15 15 15
    16 16    16 16
       17 17 17 17
    18 18 18 18 18
    ..
    
    q)\ts:1000 ?[t;{(not;(null;x))}each cols t;0b;()]
    5 7552
    q)\ts:1000 t where not any null flip t
    2 6768
    

    If the table is splayed on disk, the game changes. In a splayed table each column is stored as a file. The second argument of the Select operator is a list of constraints; each one a parse tree. Whereas null flip t above tests every cell of t, Select applies is constraints successively: only rows that have passed previous constraints are tested for the next. If no rows satisfy the first constraint/s subsequent column files may never need to be read at all.

    From this follows the rule: list your constraints in descending order of restrictiveness – if nulls are more common in certain columns, test them first.

    Regarding an in-memory test, note that not any can be a good deal faster than all not, depending on whether nulls are common.

    q)q:null flip t
    q)\ts:10000 all not q
    9 1168
    q)\ts:10000 not any q
    6 656