Search code examples
regexperl

Regex to match substrings containing n non-repeated characters


I am facing a (naive) problem with regular expression. I need to find any substrings composed of a fixed number (n) of different characters.

So, for "aaabcddd", if n=3 the substrings that I expect to find are: "abc" and "bcd".

My idea is to use n-1 capture groups and '[^' to exclude characters already matched. Thus, I wrote the following Perl regex (in Julia):

r"(([[:alpha:]])[^\2])[^\1]"

But, it is not working.

Do you have any tips?


Solution

  • You can not use a backreference to a capture group using a negated character class [^\1]

    What you can do is use a negative lookahead to assert what is directly to the right of the current position is not what you have already captured in a previous group.

    If that is the case, capture a single alpha in a new group.

    The matches abc and bcd are in capture group 1

    (?=(([[:alpha:]])(?!\2)([[:alpha:]])(?!\3|\2)[[:alpha:]]))
    
    • (?= Positive lookahead
      • ( Capture group 1
        • ([[:alpha:]]) Capture the first char in group 2
        • (?!\1)([[:alpha:]]) If not looking at what is captured by group 2 to the right, capture the second char in group 3
        • (?!\2|\1) If not looking to the right at what is captured by group 2 or 3
        • [[:alpha:]] Mach the 3rd char
      • ) Close group 1
    • ) Close the lookahead

    Regex demo

    Or a bit shorter using a case insensitive match:

    (?=(([a-z])(?!\2)([a-z])(?!\3|\2)[a-z]))