Search code examples
searchdata-structuresstring-matchingstring-search

Product name string matching against a trie (supporting omissions)


I have a list of CPU models. Right now, I think the most suitable approach would be forming a trie from the list, like this:

Intel -- Core -- i -- 3
      |       |    |- 5
      |       |    |- 7
      |       |    -- 9
      |       |
      |       -- 2 Duo
      |
      |- Xeon -- ...
      |
      |...

Now, I want to match an input string against this trie. This is easy for exact matching, but what if I need a fuzzy one, where a string sequence can have omissions? For "Intel i3", "Core i3" and "i3" to all match to "Intel -> Core -> i -> 3" in the trie.

Is trie the right task for this problem? I thought about using trie search with wildcards, but the wildcard here can be in any position in the sequence.

What data structure can I use to represent the list in a way most applicable to this problem? What algorithm do I use for search?


Solution

  • While I'm not sure it's the optimal data structure for the task, you could use an augmented trie where every node has direct links to every descendant. Obviously you're going to want better than linear search (the trie root would have a link to every other node), and you also have to deal with duplicates, but the memory costs should be fine as long as your depth is reasonable (which should be true for CPU models). This would look something like:

    class TrieAugmented:
    
        def __init__(self, val: str):
            self.val = val
            self.children = []
            self.child_paths = {}
    

    When adding CPU models, the new nodes are appended to the list of children as usual but child paths have to be updated on every ancestor node for each new node (additions are O(d^2) rather than O(d), where d is depth). I would have child_paths map node descendant values to a list of nodes in self.children which have that value or store it within child_paths. If you plan on building a static trie and then querying it, you can build the trie and only update direct children as usual before adding in all the shorter paths in a single depth-first pass through the trie. Each node occupies O(d) space instead of constant, so overall this is something like O(n^2) space instead of linear, but that should be doable for a relatively small set of items.

    If storage and implementation complexity are bigger concerns than runtime, you can use an unaugmented trie. This makes the runtime linear in number of trie nodes best case rather than roughly linear in input size, but it's very similar to matching file system paths with arbitrary nested structure. In rust glob syntax, you would translate "Core i3" to "/**/Core/**/i/**/3" and treat your trie as a file system (you do in fact insert wildcards at every position in the sequence, and they can match arbitrarily many levels of the trie). Here the trie doesn't make the lookup too fast but does make it possible to match models with omissions to their fully specified versions.