Search code examples
regexregex-lookaroundsregex-group

Regex matching a word containing a character exactly two times in a row


The problem

As stated in the title, my goal is to find a regex that matches a word if and only if it contains a substring of exactly two consecutive characters which is not surrounded by that same character.

Test cases

  • Helo --> false
  • programming --> true
  • belllike --> false (since there are three ls)
  • shellless --> true (even though there are three ls, this input should match because of the two ss

Things I've tried before

The regex [a-zA-Z]*([a-zA-Z])\1[a-zA-Z]* matches a word with at least two consequtive characters, but belllike would still match because there is no upper limit on consecutive characters.

I also tried to use negative lookaheads and lookbehinds. For one letter, this may look like this:

[a-zA-Z]*(?<!a)aa(?!a)[a-zA-Z]*

This regex fulfills all requirements for the letter a but neither I nor the people I asked could generalize it to using capture groups and thus working for any letter (copy-pasting this statement 26 times - once for each letter - and combining them with OR is not the solution I am looking for, even though it would probably work).

What I'm looking for

A solution for the described problem would be great, of course. If it cannot be done with regex, I would be equally as happy about an explanation on why that is not possible.

Background

This task was part of an assignment I had to do for uni. In a dialogue, the prof later stated that they didn't actually want to ask that question and were fine with accepting character sequences of three or more identical characters. However, the struggle of trying to find a solution for this problem sparked my interest on whether this is actually possible with regex and if so, how it could be done.

Regex flavor to use

Even though the initial task should be done in the Java 8+ regex flavour, I would be fine with a solution in any regex flavor that solves the described problem.


Solution

  • You can try:

    ^(?:.*?(.)(?!\1))?(.)\2(?!\2).*$
    

    See an demo

    • ^ - Start line anchor.
    • (?: - Open non-capture group:
      • .*? - 0+ Chars other than newline (lazy) upto;
      • (.)(?!\1) - A first capture group of a single char other than newline but assert it's not followed by the same char using a negative lookahead holding a backreference to this char.
      • )? - Close non-capture group and make it optional.
    • (.)\2(?!\2) - The same construct as before with the difference this time there is a backreference between the 2nd capture group and the negative lookahead to assert possition is followed by the exact same char.
    • .* - 0+ Chars other than newline (greedy) upto;
    • $ - End line anchor.

    A visualisation of this:

    enter image description here