The following code compiles on g++, clang, and Visual Studio:
#define HEX(hex_) 0x##hex_
int main()
{
return HEX(BadC0de);
}
as does this modification, using C++14 digit separators:
return HEX(1'Bad'C0de);
But this won't compile on g++ or clang (it works on Visual Studio):
#define HEX(hex_) 0x##hex_
int main()
{
return HEX(A'Bad'C0de);
}
g++ output:
<source>:4:1: warning: multi-character character constant [-Wmultichar]
4 | return HEX(A'Bad'C0de);
| ^
<source>: In function 'int main()':
<source>:4:17: error: expected ';' before user-defined character literal
4 | return HEX(A'Bad'C0de);
| ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
1 | #define HEX(hex_) 0x##hex_
| ^~~~
<source>:4:17: error: unable to find character literal operator 'operator""C0de' with 'int' argument
4 | return HEX(A'Bad'C0de);
| ^~~~~~~~~
<source>:1:25: note: in definition of macro 'HEX'
1 | #define HEX(hex_) 0x##hex_
| ^~~~
UPDATE: interestingly, the preprocessor output for this is
return 0xA'Bad'C0de;
which does compile, so obviously the standalone preprocessor is working differently here than the unified preprocessor.
This also fails on g++/clang, but with different errors:
return HEX(Bad'C0de);
g++ output:
<source>:4:19: warning: missing terminating ' character
4 | return HEX(Bad'C0de);
| ^
<source>:5:2: error: unterminated argument list invoking macro "HEX"
5 | }
| ^
<source>: In function 'int main()':
<source>:4:12: error: 'HEX' was not declared in this scope
4 | return HEX(Bad'C0de);
| ^~~
<source>:4:15: error: expected ';' at end of input
4 | return HEX(Bad'C0de);
| ^
| ;
<source>:4:15: error: expected '}' at end of input
<source>:3:1: note: to match this '{'
3 | {
| ^
Update: preprocessor stops before parsing the HEX() argument in this case.
I'd like to believe this is a g++ bug, but given how badly noncompliant Visual Studio's preprocessor has historically been, perhaps that is wishful thinking. And in fact, that last program not only fails on g++, it also triggers an internal compiler error on Visual Studio (at least on godbolt.org)!
msvc output:
<source>(4): error C2001: newline in constant
<source>(4): fatal error C1057: unexpected end of file in macro expansion
Internal Compiler Error in Z:\opt\compiler-explorer\windows\19.00.24210\bin\amd64\cl.exe. You will be prompted to send an error report to Microsoft later.
INTERNAL COMPILER ERROR in 'Z:\opt\compiler-explorer\windows\19.00.24210\bin\amd64\cl.exe'
Please choose the Technical Support command on the Visual C++
Help menu, or open the Technical Support help file for more information
Naively, I would have expected all the compilers to just pass all text to the macro substitution before trying to interpret its meaning (it is a PRE-processor after all!); only after the ## concatenation would I expect the token to be examined for meaning. (Yes I know that some basic parsing happens to match parenthesis, brackets, etc. so that commas within them don't split arguments, but I would not expect that to extend to any other language constructs.)
Does the standard have anything to say about these programs? Are they somehow non-conformant, or are they legal and the compilers are buggy?
This is one of those nasty holes in the spec. The preprocessor is defined (in the spec) in terms of "preprocessing tokens". The input is first split into a sequence of preprocessing tokens and then macro processing happens on that sequence.
Now the problem comes from the fact that 0xA'Bad'C0de
is a single preprocessing token, but A'Bad'C0de
is not -- it is three preprocssing tokens (A
, 'Bad'
, and C0de
) and the token paste operator ##
is defined to just paste two adjacent tokens. In this case the tokenization phase depends on what macros have been defined and what they might do.
Fixing this would require non-trivial spec changes, and require tracking directly-adjacent preprocessing tokens vs non-directly-adjacent tokens (those that have whitespace or comments between them) and having the ##
operator potentially paste additional directly-adjacent tokens when that makes sense.
This would still have problems with things like HEX(A'B)
-- how would you tell when the )
should be part of a multichar character constant token vs ending the macro argument list?