Search code examples
hashlualua-table

Lua: is there a need to use hash of string as a key in lua tables


I want to keep a list of words (let's say 20k). The only purpose of it is to check if the given word exists there. So, it's going be something known as a hash set.

-- dict here is just an example, it's going to be loadaed from a text file
dict = {["foo"] = true, ["bar"] = true, ...}


function dict.contains(self, word) do
    return self[word]
end

Is this option good enough? Or maybe this one would be better:

-- hash(string) is an exisiting external function

dict = {hash("foo") = true, hash("bar") = true, ... up to 20k}

function dict.contains(self, word) do
    return self[hash(word)]
end

I am inclining to vote for the 2nd option but thinking of internal implementation of lua hashes makes me a little bit confused. If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)? But from the other side, it should be good for statically defined tables (with interned strings). While I am going to load keys, maybe the 2nd options is still better?


Solution

  • Hash sets in Lua

    Yes, a table with string keys and truthy values (usually true) is the idiomatic, and most likely most efficient, way to implement a hash set of strings (or numbers, or "pointers" (like functions or tables, which are compared by reference)) in Lua.

    Method - data name collisions

    This however:

    function dict.contains(self, word) do
        return self[word]
    end
    

    is a bad idea, because suddenly dict:contains("contains") will be truthy (it will be the contains function).

    Similarly, dict.word will suddenly be truthy as well, even though you probably want all accesses to go through contains.

    I'd recommend just directly doing something like if words[word] then rather than unnecessarily complicating this by forcing it into an OOP style when it is not needed or useful. If you do want this, encapsulate the internal "set" of words, so store it in a field like dict._words to not have the methods conflict with the actual data you're storing.


    Premature optimization concerns

    Is this option good enough?

    It most probably is - try it and see. "20k" is a very small number of words for Lua to handle. I see no reason why the first solution shouldn't be fully satisfying.

    In asymptotical terms, accessing this table - both to insert new strings, or to get strings - has expected constant time complexity.

    Using the optimized hash table implementation provided by your Lua implementation is most probably a better idea than rolling your own.

    Or maybe this one would be better: [...] I am inclining to vote for the 2nd option

    So far you have provided no reason why it should be better, and barring exceptional circumstances, I don't think it will be. In many aspects, it's very blatantly worse:

    • It's more complex. This complexity would need to be justified.
    • It adds overhead. You need your own hashing of strings; you'll be duplicating the work Lua is already doing when it interns strings itself.
    • A hashing algorithm written in Lua is very unlikely to outperform the optimized hashing algorithm present in your Lua implementation, especially if you're using a well-optimized one like LuaJIT.

    If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)?

    Yes, on Lua 5.2 and earlier, all strings are interned and there is no way to turn this off. What this practically means is that different strings have different addresses, same strings reuse the same memory. String comparison is a simple address comparison and thus constant time. This is what enables table access to always be O(1) instead of O(length of string).

    So indeed, on Lua 5.2 and earlier, the first option is most certainly better, and by a lot.

    On Lua 5.3 and later, things get a bit more complicated: Very "long" strings (think thousands of characters) are not interned anymore, but instead are hashed on-demand, meaning comparison for these long strings is O(n), as is table access.

    Now there are two "exceptional circumstances" where the custom hashing could be justified, but I consider it very unlikely that you find yourself in either one:

    1. You're using an older Lua / LuaJIT implementation which does not protect against hash collision attacks, and need to protect against such attacks, which can make table access O(n) and thus lead to denial of service. In this case, custom hashing can help - as does just "salting" the keys with a salt unknown to the attacker.
    2. With very efficient custom hashing (think implementing a very good hashing algorithm in C, which somehow beats Lua's general-purpose string hashing for your use case), you could potentially beat Lua's built-in hashing. But at that point, you should probably be writing this performance-critical code in C or another language suitable for maximizing performance to begin with.

    But from the other side, it should be good for statically defined tables (with interned strings).

    If anything, this is another plus point for using Lua's implementation.

    While I am going to load keys, maybe the 2nd options is still better?

    No.


    TL;DR: Don't optimize prematurely; you'd most likely end up with more complex, less efficient code. Simply use a table with string keys.