Search code examples
algorithmsortingsearchgraph-algorithm

Best algorithm and Timespace complexity for querying namespace


Suppose I have a namespace that has 7 levels is of the format A/B/C/D/E/F/G = 50

This namespace currently used in a dictionary to manage the values of particular variables in the system. Naturally this represents a tree. This tree has over 20,000 leaf nodes and growing and requires updating 1,000s of times a second.

We need to support wildcards (+) that allows us to query this namespace. For example: return all namespaces that fit wildcard A/+/C/D/E+/G. or Get namespaces of A/B/C/+/+/G = 100.

Namespaces are not added dynamically.

Using regex I can replace the '+' with '(*.)' however it is O(n). I understand that is the case if it is +/+/+/+/+/+/+.

All I can think of is a standard tree with DFS? Is there a better approach to this? What would the amortized time/space complexity be?

Thank you


Solution

  • Without further assumptions, querying a tree amounts to O(n) since degenerate cases like lists are included.

    Balanced binary search trees guarantee O(log n) - the depths of any 2 root paths differ at most by 1 meaning that the tree's height is bounded by O(log n).

    Hashing root paths leads to O(1) with the usual caveats regarding hash functions.

    In the use case you've sketched, some optimizations might apply:

    • Wildcard resolution and caching:
      Depending on the frequency the tree structure changes and the number of effectively different wildcarded queries, resolving wildcarded queries into the set of matching rootpaths and caching them might be beneficial if it supports a rootpath based cache.

    • Joblist processing model Given the ratio of nodes and queries per time unit, split up the code into:

      1. Traversal threads These traverse the tree in predetermined ways. They have access to a joblist containing currently active queries. When the thread reaches a node that matches an active query, the it sends the data to an output thread.

      2. Dispatcher thread Receives queries, possibly bundling them, and dispatches them to one/several traversal threads

      3. Output thread Collects query results and outputs them.

    Amortized complexity cannot be determined without modelling the distribution and possibly the order of queries. If a locality property could be proven for queries, ie. if queries that arrive within small intervals of time tend to select the same subtrees / share large portions of the root path, collecting queries before answering them might reduce the amortized time significantly.

    In-the-large perspective

    The outlined requirements are typical for databases, and that's the domain where tons of research has produced sophisticated algorithms. So an obvious solution would be to harness a dbms and rely on the db optimizer to handle the query load efficiently.