In the q for mortals chapter on data normalisation, i.e. the task of eliminating duplication in a list, it recommends using enumerations for finding distinct values in a list as its faster to traverse over integers than it is over symbols of variable length
u:`g`ibm`intl`msft / unique list of tickers
v:1000000?u / list with duplicate tickers
k:u?v / positions in u
\t:10 distinct v / performing distinct on symbols 10 times and timing
\t:10 distinct k / performing distinct on positions 10 times and timing
I find that distinct v
is much faster than distinct k
which is not in line with what was promised.
Thanks for the help.
Enumeration is usually used for data saved to disk to aid with compression etc That's where you will see the bigger performance gain.
KDB+ 3.5 2017.04.06 Copyright (C) 1993-2017 Kx Systems
Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
u:`g`ibm`intl`msft / unique list of tickers
v:1000000?u / list with duplicate tickers
q)k:`u$v //enumerate v against u
q)k
`u$`g`g`intl`ibm`intl`ibm`intl`msft`intl`ibm`g`msft`ibm`intl`intl`ibm`g`ibm`i..
q)save `:k
`:k
q)save `:u
`:u
q)save `:v
`:v
q)\\
KDB+ 3.5 2017.04.06 Copyright (C) 1993-2017 Kx Systems
Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
q)u:get `:u
q)\ts:10 distinct get `:v
462 8388848
q)\ts:10 distinct get `:k
37 4194544
q)
But you do raise an interesting question regards why is distinct faster on a list of symbols (in mem) that a list of ints.