Search code examples
c++visual-c++g++c++14clang++

Why does this simple code not consistently compile?


The following code compiles on g++, clang, and Visual Studio:

#define HEX(hex_)  0x##hex_
int main()
{
    return HEX(BadC0de);
}

as does this modification, using C++14 digit separators:

    return HEX(1'Bad'C0de);

But this won't compile on g++ or clang (it works on Visual Studio):

#define HEX(hex_)  0x##hex_
int main()
{
    return HEX(A'Bad'C0de);
}

g++ output:

<source>:4:1: warning: multi-character character constant [-Wmultichar]
    4 |     return HEX(A'Bad'C0de);
      | ^  
<source>: In function 'int main()':
<source>:4:17: error: expected ';' before user-defined character literal
    4 |     return HEX(A'Bad'C0de);
      |                 ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
    1 | #define HEX(hex_)   0x##hex_
      |                         ^~~~
<source>:4:17: error: unable to find character literal operator 'operator""C0de' with 'int' argument
    4 |     return HEX(A'Bad'C0de);
      |                 ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
    1 | #define HEX(hex_)   0x##hex_
      |                         ^~~~

UPDATE: interestingly, the preprocessor output for this is

    return 0xA'Bad'C0de;

which does compile, so obviously the standalone preprocessor is working differently here than the unified preprocessor.

This also fails on g++/clang, but with different errors:

    return HEX(Bad'C0de);

g++ output:

<source>:4:19: warning: missing terminating ' character
    4 |     return HEX(Bad'C0de);
      |                   ^
<source>:5:2: error: unterminated argument list invoking macro "HEX"
    5 | }
      |  ^
<source>: In function 'int main()':
<source>:4:12: error: 'HEX' was not declared in this scope
    4 |     return HEX(Bad'C0de);
      |            ^~~
<source>:4:15: error: expected ';' at end of input
    4 |     return HEX(Bad'C0de);
      |               ^
      |               ;
<source>:4:15: error: expected '}' at end of input
<source>:3:1: note: to match this '{'
    3 | {
      | ^

Update: preprocessor stops before parsing the HEX() argument in this case.

I'd like to believe this is a g++ bug, but given how badly noncompliant Visual Studio's preprocessor has historically been, perhaps that is wishful thinking. And in fact, that last program not only fails on g++, it also triggers an internal compiler error on Visual Studio (at least on godbolt.org)!

msvc output:

<source>(4): error C2001: newline in constant
<source>(4): fatal error C1057: unexpected end of file in macro expansion
Internal Compiler Error in Z:\opt\compiler-explorer\windows\19.00.24210\bin\amd64\cl.exe.  You will be prompted to send an error report to Microsoft later.
INTERNAL COMPILER ERROR in 'Z:\opt\compiler-explorer\windows\19.00.24210\bin\amd64\cl.exe'
    Please choose the Technical Support command on the Visual C++
    Help menu, or open the Technical Support help file for more information

Naively, I would have expected all the compilers to just pass all text to the macro substitution before trying to interpret its meaning (it is a PRE-processor after all!); only after the ## concatenation would I expect the token to be examined for meaning. (Yes I know that some basic parsing happens to match parenthesis, brackets, etc. so that commas within them don't split arguments, but I would not expect that to extend to any other language constructs.)

Does the standard have anything to say about these programs? Are they somehow non-conformant, or are they legal and the compilers are buggy?


Solution

  • This is one of those nasty holes in the spec. The preprocessor is defined (in the spec) in terms of "preprocessing tokens". The input is first split into a sequence of preprocessing tokens and then macro processing happens on that sequence.

    Now the problem comes from the fact that 0xA'Bad'C0de is a single preprocessing token, but A'Bad'C0de is not -- it is three preprocssing tokens (A, 'Bad', and C0de) and the token paste operator ## is defined to just paste two adjacent tokens. In this case the tokenization phase depends on what macros have been defined and what they might do.

    Fixing this would require non-trivial spec changes, and require tracking directly-adjacent preprocessing tokens vs non-directly-adjacent tokens (those that have whitespace or comments between them) and having the ## operator potentially paste additional directly-adjacent tokens when that makes sense.

    This would still have problems with things like HEX(A'B) -- how would you tell when the ) should be part of a multichar character constant token vs ending the macro argument list?