Search code examples
regexcompiler-constructionflex-lexer

What's the regular expression for an alphabet without the first occurrence of a letter?


I am trying to use FLEX to recognize some regular expressions that I need. What I am looking for is given a set of characters, say [A-Z], I want a regular expression that can match the first letter no matter what it is, followed by a second letter that can be anything in [A-Z] besides the first letter.

For example, if I give you AB, you match it but if I give you AA you don't. So I am kind of looking for a regex that's something like [A-Z][A-Z^Besides what was picked in the first set].

How could this be implemented for more occurrences of letters? Say if I want to match 3 letters without each new letter being anything from the previous ones. For instance ABC but not AAB.

Thank you!


Solution

  • (Mathematical) regular expressions have no context. In (f)lex -- where regular expressions are actually regular, unlike most regex libraries -- there is no such thing as a back-reference, positive or negative.

    So the only way to accomplish your goal with flex patterns is to enumerate the possibilities, which is tedious for two letters and impractical for more. The two letter case would be something like (abbreviated);

    A[B-Z]|B[AC-Z]|C[ABD-Z]|D[A-CE-Z]|…|Z[A-Y]
    

    The inverse expression also has 26 cases but is easier to type (and read). You could use (f)lex's first-longest-match rule to make use of it:

    AA|BB|CC|DD|…|ZZ    { /* Two identical letters */ }
    [[:upper:]]{2}  { /* This is the match */ }
    

    Probably, neither of those is the best solution. However, I don't think I can give better advice without knowing more specifics. The key is knowing what action you want to take if the letters do match, which you don't specify. And what the other patterns are. (Recall that a lexical scanner is intended to divide the input into tokens, although you are free to ignore a token once it is identified.)

    Flex does come with a number of useful features which can be used for more flexible token handling, including yyless (to rescan part or all of the token), yymore (to combine the match with the next token), and unput (to insert a character into the input stream). There is also REJECT, but you should try other solutions first. See the flex manual chapter on actions for more details.

    So the simplest solution might be to just match any two capital letters, and then in the action check whether or not they are the same.