Search code examples
javaregexsplitoverlapping-matches

Regex split into overlapping strings


I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:

public class StringSplit {
    public static void main(String args[]) {
        System.out.println(
            java.util.Arrays.deepToString(
                "12345".split(INSERT_REGEX_HERE)
            )
        ); // prints "[12, 23, 34, 45]"
    }
}

If possible, then simply provide the regex (and preemptively some explanation on how it works).

If it's only possible in some regex flavors other than Java, then feel free to provide those as well.

If it's not possible, then please explain why.


BONUS QUESTION

Same question, but with a find() loop instead of split:

    Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
    while (m.find()) {
        System.out.println(m.group());
    } // prints "12", "23", "34", "45"

Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).

And if they don't exist, I'd like a good solid explanation why.


Solution

  • I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:

    Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
    while (m.find())
    {
      System.out.println(m.group(1));
    }
    

    Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.

    As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.

    There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.