Search code examples
algorithmtime-complexitypattern-matchingwildcardcomplexity-theory

Performance and Algorithms for Pattern Matching, With Wildcards


I'm trying to determine the best way to implement a pattern matching algorithm for a particular use case. I'll give a bit of an abstraction of what I'm trying to do.

I would like my algorithm to be able to search a sequence of characters, and find the first match to a particular pattern. There are a few subtleties. Firstly, the characters may occur in any order. Additionaly, the pattern may contain wildcards, and there may be multiple types of wildcards.

For example, suppose we have a pattern ?A ?G ?G, where the ? denotes that the proceeding character is a wildcard. Thus, the pattern is looking for the first match for any character and two repeated characters, occuring in any order. We could then have a sequence "ABCDEFF". Doing a depth first search as a tree, the algorithm would find the match AFF. Note that for a sequence FABCDEF, we would also find the match AFF, as the order isn't important.

I've estimated that this would have a runtime of about O(n m!), where n is the pattern length and m is the sequence length. Obviously, this isn't ideal for searching a long sequence. I was just wondering if there was a clever or best way of implementing this, or if there are any sort of mimoisation techniques etc that would increase the speed of the algorithm.


Solution

  • First, there is no truly efficient solution to your problem.

    We can demonstrate this by reducing the 3-partition problem to it. Given a multiset S of n = 3 *m  positive integers summing to n*T we can have n different letters, each occurring T times. Your pattern will contain 3*m wildcards, each repeated as many times as the corresponding integer in S. There is a way to match the pattern to the string if and only if there is a 3-partition of S. And therefore a solver for your problem can be used to solve the 3-partition problem. That makes it NP-hard. But solutions for your problem that say which letters match to which wildcards can be verified in polynomial time. Since verification is in P, your problem is actually NP-complete.

    In practice, a greedy solution may be good enough. Create a histogram for the string. Create a histogram for the pattern. Try matching the largest bucket for the histogram to the smallest bucket in the string that can match it. Keep going until you succeed or fail.

    If you want better, I'd suggest following the suggestion in the comments to use an integer solver for the set cover problem.