Search code examples
flex-lexerlexer

How to declare and reuse a character class in flex lexer?


Normally, when you want to reuse a regular expression, you can declare it in flex in declaration section. They will get enclosed by parenthesis by default. Eg:

num_seq [0-9]+

%%

{num_seq} return INT;  // will become ([0-9]+)

{num_seq}\.{num_seq} return FLOAT;  // will become ([0-9]+)\.([0-9]+)

But, I wanted to reuse some character classes. Can I define custom classes like [:alpha:], [:alnum:] etc. A toy Eg:

chars [a-zA-Z]

%%

  // will become (([a-zA-Z]){-}[aeiouAEIOU])+  // ill-formed
  // desired ([a-zA-Z]{-}[aeiouAEIOU])+  // correct
({chars}{-}[aeiouAEIOU])+ return ONLY_CONS;

({chars}{-}[a-z])+ return ONLY_UPPER;

({chars}{-}[A-Z])+ return ONLY_LOWER;

But currently, this will fail to compile because of the parenthesis added around them. Is there a proper way or at-least a workaround to achieve this?


Solution

  • This might be useful from time to time, but unfortunately it has never been implemented in flex. You could suppress the automatic parentheses around macro substitution by running flex in lex compatibility mode, but that has other probably undesirable effects.

    Posix requires that regular expression bracket syntax includes, in addition to the predefined character classes,

    …character class expressions of the form: [:name:] … in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.

    Unfortunately, flex does not implement this requirement. It is not too difficult to patch flex to do this, but since there is no portable mechanism to allow the user to add charclasses to their locale --and, indeed, many standard C library implementations lack proper locale support-- there is little incentive to make this change.

    Having looked at all these options, I eventually convinced myself that the simplest portable solution is to preprocess the flex input file to replace [:name:] with a set of characters based on name. Since that sequence of characters is unlikely to be present in a flex input file, a simple-minded search and replace using sed or python is adequate; correctly parsing the flex input file seems to me to be more trouble than it was worth.