Search code examples
regexparsingsequences

Matching token sequences


I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.

          11111
012345678901234
ab ab bc  a cba

In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.

Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?

I am hoping to implement this in Java.


Solution

  • Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':

    (?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
    

    So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.

    Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):

    public static void find(String haystack, String tokens, int windowLen) {
        char[] tokenChars = tokens.toCharArray();
        int hayLen = haystack.length();
    
        int pos = 0;
        nextPos:
        while (pos + windowLen <= hayLen) {
            for (char c : tokenChars) {
                int i = haystack.indexOf(c, pos);
                if (i < 0) return;
    
                if (i - pos >= windowLen) {
                    pos = i - windowLen + 1;
                    continue nextPos;
                }
            }
    
            // match found at pos
            System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
            pos++;
        }
    }