Search code examples
regexflex-lexer

Using Regex to match multiple instances of a character in a string from a list but only one of the following characters


Suppose I have a set of characters {abcd} . And suppose I have a sequence of strings.

aabb
ddccddcc
aaaa
bbbb
dddd
cdddcd

I would like write a regular expression to match all strings which any one (and only one) character from this set is repeated four times in succession, we match this string.

For example, aaaa would be true. However aabb would not meet this criteria, as though it uses a character in the set for times successively, it is not the same four character. Additionally, cdddcd would not meet this criteria as the four d's are seperated by a c character. Therefore, we must rule out using [abcd]{4} .

Though I could do the following:

a{4}    { return A;}
b{4}    { return B;}
c{4}    { return C;}
d{4}    { return D;}

However, I would like to develop my ability to write regular expressions, and I am sure there must be a way using expressing some logic to perform the task I would like using fewer statements. Thanks for reading! :)


Solution

  • There is no way to write that in a Flex regular expression other than writing it out as you did.

    Many so-called "regex" libraries do provide a way to do it, commonly called "back-references". These libraries are usually supercharged with a variety of such features which can result in regex matches taking quadratic or even exponential time (although they can be very convenient).

    However, flex does not implement these features. Flex regular expressions really are regular (that is, they correspond to the mathematical model), and as a consequence can be compiled into state machines whose execution time is guaranteed linear and generally very rapid. In the expect use case for Flex -- compilers -- the tokens rarely require complex patterns and the performance guarantees are important.

    (This is not quite the full story but it's a good approximation.)