Search code examples
regexalgorithmstring-matching

Algorithm to find out whether the matches for two Glob patterns (or Regular Expressions) intersect


I'm looking at matching glob-style patterns similar the what the Redis KEYS command accepts. Quoting:

  • h?llo matches hello, hallo and hxllo
  • h*llo matches hllo and heeeello
  • h[ae]llo matches hello and hallo, but not hillo

But I am not matching against a text string, but matching the pattern against another pattern with all operators being meaningful on both ends.

For example these patterns should match against each other in the same row:

prefix*       prefix:extended*
*suffix       *:extended:suffix
left*right    left*middle*right
a*b*c         a*b*d*b*c
hello*        *ok
pre[ab]fix*   pre[bc]fix*

And these should not match:

prefix*       wrong:prefix:*
*suffix       *suffix:wrong
left*right    right*middle*left
pre[ab]fix*   pre[xy]fix*
?*b*?         bcb

So I'm wondering ...

  • if this is possible to do (implement a verification algorithm), if at all?
  • if not possible, what subset of regex would be possible? (i.e. disallow * wildcard?)
  • if it is indeed possible, what is an efficient algorithm?
  • what are the time complexity required?

EDIT: Found this other question on RegEx subset but this is not exactly the same as the words that hello* and *ok matches is not a subset/superset of each other but they do intersect.

So I guess mathematically, this might be phrased as; is it possible to deterministically check that a set of words that one pattern match, intersecting with a set of words that another pattern matches, result in a non-empty set?


EDIT: A friend @neizod drew up this elimination table which neatly visualize what might be a potential/partial solution: Elimination rule


EDIT: Will adds extra bounty for those who can also provide working code (in any language) and test cases that proves it.


EDIT: Added the ?*b*? test case discovered by @DanielGimenez in the comments.


Solution

  • Now witness the firepower of this fully ARMED and OPERATIONAL battle station!

    (I have worked too much on this answer and my brain has broken; There should be a badge for that.)

    In order to determine if two patterns intersect, I have created a recursive backtracking parser -- when Kleene stars are encountered a new stack is created so that if it fails in the future everything is rolled back and and the star consumes the next character.

    You can view the history of this answer to determine how arrived at all this and why it was necessary, but basically it wasn't sufficient to determine an intersection by looking ahead only one token, which was what I was doing before.

    This was the case that broke the old answer [abcd]d => *d. The set matches the d after the star, so the left side would still have tokens remaining, while the right side would be complete. However, these patterns two intersect on ad, bd, cd and dd, so that needed to be fixed. My almost O(N) answer was thrown out.

    Lexer

    The lexing process is trivial, except that is processes escape characters and removes redundant stars. Tokens are broken out into sets, stars, wild character (?), and character. This is different than my previous versions where one token was a string of characters instead of a single character. As more cases come up, having strings as tokens was more of a hindrance than advantage.

    Parser

    Most of the functions of the parser are pretty trivial. A switch given the left side's type, calls a function that is a switch that determines the appropriate function to compare it with the right side's type. The result from the comparison bubbles up the two switches to the original callee, typically the main loop of the parser.

    Parsing Stars

    The simplicity ends with the star. When that is encountered it takes over everything. First it compares its side's next token with the other side's, advancing the other side until if finds a match.

    Once the match is found, it then checks if everything matches all the way up to the end of both patterns. If it does then the patterns intersect. Otherwise, it advances the other side's next token from the original one it was compared against and repeats the process.

    When two anys are encountered then the take off into their own alternative branches starting from each others' next token.

    function intersects(left, right) {
        var lt, rt,
            result = new CompareResult(null, null, true);
    
        lt = (!left || left instanceof Token) ? left : tokenize(left);
        rt = (!right || right instanceof Token) ? right : tokenize(right);
    
        while (result.isGood && (lt || rt)) {
            result = tokensCompare(lt, rt);
    
            lt = result.leftNext;
            rt = result.rightNext;
        }
    
        return result;
    }
    
    function tokensCompare(lt, rt) {
        if (!lt && rt) return tokensCompare(rt, lt).swapTokens();
    
        switch (lt.type) {
            case TokenType.Char: return charCompare(lt, rt);
            case TokenType.Single: return singleCompare(lt, rt);
            case TokenType.Set: return setCompare(lt, rt);
            case TokenType.AnyString: return anyCompare(lt, rt);
        }
    }
    
    function anyCompare(tAny, tOther) {
        if (!tOther) return new CompareResult(tAny.next, null);
    
        var result = CompareResult.BadResult;
    
        while (tOther && !result.isGood) {
            while (tOther && !result.isGood) {
                switch (tOther.type) {
                    case TokenType.Char: result = charCompare(tOther, tAny.next).swapTokens(); break;
                    case TokenType.Single: result = singleCompare(tOther, tAny.next).swapTokens(); break;
                    case TokenType.Set: result = setCompare(tOther, tAny.next).swapTokens(); break;
                    case TokenType.AnyString:
                        // the anyCompare from the intersects will take over the processing.
                        result = intersects(tAny, tOther.next);
                        if (result.isGood) return result;
                        return intersects(tOther, tAny.next).swapTokens();
                }
    
                if (!result.isGood) tOther = tOther.next;
            }
    
            if (result.isGood) {
                // we've found a starting point, but now we want to make sure this will always work.
                result = intersects(result.leftNext, result.rightNext);
                if (!result.isGood) tOther = tOther.next;
            }
        }
    
        // If we never got a good result that means we've eaten everything.
        if (!result.isGood) result = new CompareResult(tAny.next, null, true);
    
        return result;
    }
    
    function charCompare(tChar, tOther) {
        if (!tOther) return CompareResult.BadResult;
    
        switch (tOther.type) {
            case TokenType.Char: return charCharCompare(tChar, tOther); 
            case TokenType.Single: return new CompareResult(tChar.next, tOther.next);
            case TokenType.Set: return setCharCompare(tOther, tChar).swapTokens(); 
            case TokenType.AnyString: return anyCompare(tOther, tChar).swapTokens();
        }
    }
    
    function singleCompare(tSingle, tOther) {
        if (!tOther) return CompareResult.BadResult;
    
        switch (tOther.type) {
            case TokenType.Char: return new CompareResult(tSingle.next, tOther.next);
            case TokenType.Single: return new CompareResult(tSingle.next, tOther.next);
            case TokenType.Set: return new CompareResult(tSingle.next, tOther.next);
            case TokenType.AnyString: return anyCompare(tOther, tSingle).swapTokens();
        }
    }
    function setCompare(tSet, tOther) {
        if (!tOther) return CompareResult.BadResult;
    
        switch (tOther.type) {
            case TokenType.Char: return setCharCompare(tSet, tOther);
            case TokenType.Single: return new CompareResult(tSet.next, tOther.next);
            case TokenType.Set: return setSetCompare(tSet, tOther);
            case TokenType.AnyString: return anyCompare(tOther, tSet).swapTokens();
        }
    }
    
    function anySingleCompare(tAny, tSingle) {
        var nextResult = (tAny.next) ? singleCompare(tSingle, tAny.next).swapTokens() :
            new CompareResult(tAny, tSingle.next);
        return (nextResult.isGood) ? nextResult: new CompareResult(tAny, tSingle.next);
    }
    
    function anyCharCompare(tAny, tChar) {
        var nextResult = (tAny.next) ? charCompare(tChar, tAny.next).swapTokens() :
            new CompareResult(tAny, tChar.next);
    
        return (nextResult.isGood) ? nextResult : new CompareResult(tAny, tChar.next);
    }
    
    function charCharCompare(litA, litB) {
        return (litA.val === litB.val) ?
            new CompareResult(litA.next, litB.next) : CompareResult.BadResult;
    }
    
    function setCharCompare(tSet, tChar) {
        return (tSet.val.indexOf(tChar.val) > -1) ?
            new CompareResult(tSet.next, tChar.next) : CompareResult.BadResult;
    }
    
    function setSetCompare(tSetA, tSetB) {
        var setA = tSetA.val,
            setB = tSetB.val;
    
        for (var i = 0, il = setA.length; i < il; i++) {
            if (setB.indexOf(setA.charAt(i)) > -1) return new CompareResult(tSetA.next, tSetB.next);
        }
        return CompareResult.BadResult;
    }
    

    jsFiddle

    Time Complexity

    Anything with the words "recursive backtracking" in it is at least O(N2).

    Maintainability and Readability

    I purposely broke out any branches into there own functions with a singular switch. Assitionally I used named constants when a one character string would suffice. Doing this made the code longer and more verbose, but I think it makes it easier to follow.

    Tests

    You can view all the tests in the Fiddle. You can view the comments in the Fiddle output to glean their purposes. Each token type was tested against each token type, but I haven't made one that tried all possible comparisons in a single test. I also came up with a few random tough ones like the one below.

    abc[def]?fghi?*nop*[tuv]uv[wxy]?yz => a?[cde]defg*?ilmn[opq]*tu*[xyz]*

    I added an interface on the jsFiddle if anybody wants to test this out themselves. The logging is broken once I added the recursion.

    I don't think I tried enough negative tests, especially with the last version I created.

    Optimization

    Currently the solution is a brute force one, but is sufficient to handle any case. I would like to come back to this at some point to improve the time complexity with some simple optimizations.

    Checks at the start to reduce comparisons could increase processing time for certain common scenarios. For example, if one pattern starts with a star and one ends with one then we already know they will intersect. I can also check all the characters from the start and end of the patterns and remove them if the match on both patterns. This way they are excluded from any future recursion.

    Acknowledgements

    I used @m.buettner's tests initially to test my code before I came up with my own. Also I walked through his code to help me understand the problem better.