Matching token sequences

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.

          11111
012345678901234
ab ab bc  a cba

In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.

Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?

I am hoping to implement this in Java.

Solution

Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':

(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}

So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.

Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):

public static void find(String haystack, String tokens, int windowLen) {
    char[] tokenChars = tokens.toCharArray();
    int hayLen = haystack.length();

    int pos = 0;
    nextPos:
    while (pos + windowLen <= hayLen) {
        for (char c : tokenChars) {
            int i = haystack.indexOf(c, pos);
            if (i < 0) return;

            if (i - pos >= windowLen) {
                pos = i - windowLen + 1;
                continue nextPos;
            }
        }

        // match found at pos
        System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
        pos++;
    }
}