Search code examples
architecturesystemscalabilitydistributed-computingtrie

How to scale a trie across multiple servers


Does anyone know how I might scale a Trie across multiple machines? Say the first machine runs out of space and I need to add more words from a very large dictionary, what might I do to add more words? (I am a Java thinker, but I believe the answer can be language agnostic). I have already realized that I cannot just say one machine for each first character, but that doesn't really scale.


Solution

  • Ok, given the assumption, that both of your machines have the same resources available, let’s first look at a simpler example:

    how would you scale a binary tree? Or even better - an AVL tree? There are several examples to do this:

    1. If there would be only 2 machines and storage is your problem, I’d keep the root and the left subtree on one machine, and send the right subtree to the other.
    2. If you’d have 3 machines and would also want to have a load balancer, the root would stay on one machine and the left and right subtree would be split accross the other 2 machines. If you have 5 you keep the root and first level of children on the load balancer and split the rest of the tree.

    (note that balancing such a distributed tree will be much more complicated, as you’ll need to communicate with other machines and do it possibly inside a distributed transaction, to be able to answer all requests concurrently)

    So, now a trie, which - AFAIR - is a tree / letter. If the letters in your words would be distributed evenly, you could have A-M on one machine and N-Z on the other. This will probably not work, but you’ll for sure be able to split it more or less 50/50 like this.

    If you now want to add more and more machines, I’d keep a main node which would work as a load balancer and distribute it to the child nodes, which would only take care of few letters. For instance you could have nodes

    • A-F
    • G-M
    • N-R
    • S
    • T-Z

    Assuming, you have roughly as much data for the letters A-F as you have for the letter S. (There actually might be a language, where this would be at least close to the most optimal distribution)

    Now if you get too many letters in A-F you can just split it into A-D and E-F for instance, nothing really changes there. The problem will be if you’d get too many letters in S. Now you’d have 3 possibilities:

    1. You make another load balancer for the letter S - this will be for sure easy, as you already have a load balancer implemented and you can use the same functionality on any level
    2. You keep letters SA-SM (for instance) in one node, which will be the main node, store SN-SZ on a separate node. So if you get SP.. the first load balancer would send it to your SA-SM node and that one would forward it to SN-SZ
    3. You modify the load root load balancer to be able to specify more complex boundaries between nodes, such as you’d have now the nodes

      • A-F
      • G-M
      • N-R
      • SA-SM
      • SN-SZ
      • T-Z

    Here number 1 is probably the easiest and cleanest solution, but might have some unused hardware. In case you can use different resources for nodes, option 1 with a small load balancer for the letter S would be probably the way to go. Option 2 is a dirty mix, and option 3 might be the nicest way to go, but it makes the load balancer potentially complicated and error prone.

    Hope this ideas help you.