Search code examples
c++pattern-matchingtriesuffix-treesuffix-array

Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)


I'm looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I've found out about tries, suffix-trees and suffix-arrays. But I couldn't find an ready-to-use implementation in C/C++ so far (and implementing it by myself seems difficult and error-prone to me). But I'm still not sure if Suffix-Arrays are really the thing I'm looking for... I've tried libdivsufsort and esaxx, but couldn't find out how to use them for my needs:

I want to use an predefined set of strings, with wildcards (or even regular expressions) to match an user input. I got a huge list of predefined strings i.e.

"WHAT IS *?" "WHAT IS XYZ?" "HOW MUCH *?" ...

Now I want to find the best matching string (if there's one, that matches at all). I.e. User input: >WHAT IS XYZ? Should find "WHAT IS XYZ?" instead of "WHAT IS *?", but "WHAT IS SOMETHING?" should find "WHAT IS *?" (assuming * is a wildcard for any count of characters).

Building the structure isn't time critical (and the structure don't have to be super space efficient), but the search shouldn't take too long. How can that be done easily? Any Framework/Library or code example is welcome

Thanks


Solution

  • Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.

    The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:

    1: what is *
    2: where is *
    3: do * need to
    etc.
    

    And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:

    what  -> 1
    is    -> 1,2
    where -> 2
    do    -> 3
    need  -> 3
    to    -> 3
    

    Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:

    INPUT: what is something?
    
    TOKENS:
       what      -> 1
       is        -> 1,2
       something -> n/a
    

    You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>).

    You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.

    You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.

    The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).

    Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:

    what is *||where is *||do * need to||...
    

    Then you construct a suffix array of this concatenated string.

    Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols * and $ are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.

    Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).

    Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).

    Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.