Search code examples
flex-lexer

Flex dropping predefined macro


I have the following flex source:

%{
#if !defined(__linux__) && !defined(__unix__)
/* Maybe on windows */
#endif
int num_chars = 0;                                                                                                                                                                                                   
%}

%%
.       ++num_chars;                                                                                                                                                                                                 
%%
int main()
{
  yylex();                                                                                                                                                                                                             
  printf("%d chars\n", num_chars);                                                                                                                                                                                     
  return 0;                                                                                                                                                                                                            
}

int yywrap()
{
  return 1;                                                                                                                                                                                                            
}

I generate a C file by the command flex flextest.l and compile the result with gcc -o fltest lex.yy.c

To my surprise, I get the following output:

flextest.l:2:37: error: operator "defined" requires an identifier
  #if !defined(__linux__) && !defined(__unix__)

After further checking, the issue seems to be that flex has actually replaced __unix__ with the empty string, as shown by:

$ grep __linux_ lex.yy.c
#if !defined(__linux__) && !defined()

Why does this happen, and is it possible to avoid it?


Solution

  • It's actually m4 (the macro processor which is used by current versions of flex) which is expanding __unix__ to the empty string. The Gnu implementation of m4 defines certain symbols to empty macros so that they can be tested with ifdef.

    Of course, it's (better said, it was) a bug in flex. Flex shouldn't allow m4 to expand macros within user content copied from the scanner definition file, and the current version of flex correctly arranges for the text included from the scanner description file to be quoted so that it will pass through m4 unmodified even if it happens to include a string which could be interpreted by m4 as a macro expansion.

    The bug is certainly present in v2.5.39 and v2.6.1 of flex. I didn't test all previous versions, but I suppose it was introduced when flex was modified to use m4, which was v2.5.30 according to the NEWS file.

    This particular quoting issue was fixed in v2.6.2 but the current version of flex (2.6.4) contains various other bug fixes, so I'd recommend you upgrade to the latest version.


    If you really need a version which could work with both the buggy and the more recent versions of flex, you could use one of the two following hacks:

    1. Find some other way to write __unix__. One possibility is the following

      #define C(x,y) x##y
      #define UNIX_ C(__un,ix__)
      #if !defined(__linux__) && !UNIX_
      

      That hack won't work with defined, since defined(UNIX_) tests whether UNIX_ is defined, not whether what it expands to is defined. But normally built-in symbols like __unix__ are actually defined to be 1, if they are defined, and the #if directive treats any identifier which is not #define'd as though it were 0, which means that you can usually leave use x instead of defined(x). (However, it will produce different results if there were a #define x 0 in effect, so it's not quite a perfect substitute.)

    2. Flex, like many m4 applications, redefines m4's quote marks to be [[ and ]]. Both the buggy flex and the corrected versions substitute these quote marks with a rather elaborate sequence which effectively quotes the quote marks. However, the buggy version does not otherwise quote user-defined text, so macro substitutions will be performed in user text. (As mentioned, this is why __unix__ becomes the empty string.

      In flex versions in which user-defined text is not quoted, it is possible to invoke the m4 macro which redefines quote marks. These new quote marks can then be used to quote the #if line, preventing macro substitution of __unix__. However, the quote definition must be restored, or it will completely wreck macro processing of the rest of the file. That's a bit tricky because it is impossible to write [[. (Flex will substitute it with a different string.)

      The following seems to do the trick. Note that the macro invocations are placed inside C comments. The changequote macros will expand to an empty string, if they are expanded. But in flex versions since v2.6.2, user-supplied text is quoted, so the changequote macros will not be expanded. Putting them inside comments hides them from the C compiler.

      %{
      /*m4_changequote(<<,>>)<<*/
      #if !defined(__linux__) && !defined(__unix__)
      /*>>m4_changequote(<<[>><<[>>,<<]>><<]>>)*/
      
      /* Maybe on windows */
      #endif
      

      (The m4 macro which changes quote marks is changequote but flex invokes m4 with the -P flag which changes builtins like changequote to m4_changequote. In the second call to changequote, the two [ which make up the [[ sign are individually quoted with the temporary << quote marks, which hides them from the code in flex which modifies use of [[.)

      I don't know how reliable this hack is but it worked on the versions of flex which I had kicking around on my machine, including 2.5.4 (pre-M4) 2.5.39 (buggy), 2.6.1 (buggy), 2.6.2 (somewhat debugged) and 2.6.4 (more debugged).