I want to keep a list of words (let's say 20k). The only purpose of it is to check if the given word exists there. So, it's going be something known as a hash set.
-- dict here is just an example, it's going to be loadaed from a text file
dict = {["foo"] = true, ["bar"] = true, ...}
function dict.contains(self, word) do
return self[word]
end
Is this option good enough? Or maybe this one would be better:
-- hash(string) is an exisiting external function
dict = {hash("foo") = true, hash("bar") = true, ... up to 20k}
function dict.contains(self, word) do
return self[hash(word)]
end
I am inclining to vote for the 2nd option but thinking of internal implementation of lua hashes makes me a little bit confused. If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)? But from the other side, it should be good for statically defined tables (with interned strings). While I am going to load keys, maybe the 2nd options is still better?
Yes, a table with string keys and truthy values (usually true
) is the idiomatic, and most likely most efficient, way to implement a hash set of strings (or numbers, or "pointers" (like functions or tables, which are compared by reference)) in Lua.
This however:
function dict.contains(self, word) do return self[word] end
is a bad idea, because suddenly dict:contains("contains")
will be truthy (it will be the contains
function).
Similarly, dict.word
will suddenly be truthy as well, even though you probably want all accesses to go through contains
.
I'd recommend just directly doing something like if words[word] then
rather than unnecessarily complicating this by forcing it into an OOP style when it is not needed or useful. If you do want this, encapsulate the internal "set" of words, so store it in a field like dict._words
to not have the methods conflict with the actual data you're storing.
Is this option good enough?
It most probably is - try it and see. "20k" is a very small number of words for Lua to handle. I see no reason why the first solution shouldn't be fully satisfying.
In asymptotical terms, accessing this table - both to insert new strings, or to get strings - has expected constant time complexity.
Using the optimized hash table implementation provided by your Lua implementation is most probably a better idea than rolling your own.
Or maybe this one would be better: [...] I am inclining to vote for the 2nd option
So far you have provided no reason why it should be better, and barring exceptional circumstances, I don't think it will be. In many aspects, it's very blatantly worse:
If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)?
Yes, on Lua 5.2 and earlier, all strings are interned and there is no way to turn this off. What this practically means is that different strings have different addresses, same strings reuse the same memory. String comparison is a simple address comparison and thus constant time. This is what enables table access to always be O(1) instead of O(length of string).
So indeed, on Lua 5.2 and earlier, the first option is most certainly better, and by a lot.
On Lua 5.3 and later, things get a bit more complicated: Very "long" strings (think thousands of characters) are not interned anymore, but instead are hashed on-demand, meaning comparison for these long strings is O(n), as is table access.
Now there are two "exceptional circumstances" where the custom hashing could be justified, but I consider it very unlikely that you find yourself in either one:
But from the other side, it should be good for statically defined tables (with interned strings).
If anything, this is another plus point for using Lua's implementation.
While I am going to load keys, maybe the 2nd options is still better?
No.
TL;DR: Don't optimize prematurely; you'd most likely end up with more complex, less efficient code. Simply use a table with string keys.