Search code examples
c++cc-preprocessorc89

What are the definitions for valid and invalid pp-tokens?


I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.

If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.

The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.

So my questions are:

  1. What are the rules for valid pp-tokens?

  2. Are there difference between the different C and C++ Standards?

  3. My current problem: Is the following legal with all 2 operator orders?(**)

    #define test(A) test_## A ## _THING
    int test(0001) = 2;
    

Comments:

(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.

(**) This is no actual application for the numbering. But the pattern is equivalent.


Solution

  • What are the rules for valid pp-tokens?

    These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.

    With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)

    In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.

    Are there differences between C and C++?

    Yes, there are slight differences:

    • C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.

    • C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)

    • C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.

    • C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.

    • There are minor differences in the handling of universal character names (eg. \u0234).

    • In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.

    Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?

    Yes, that is legal in both languages.

        test_ ## 0001 => test_0001             (identifier)
        test_0001 ## _THING => test_0001_THING (identifier)
    
        0001 ## _THING => 0001_THING           (pp-number)
        test_ ## 0001_THING => test_0001_THING (identifier)
    

    What are examples of invalid token concatenation?

    Suppose we have

    #define concat3(a, b, c) a ## b ## c
    

    Now, the following are invalid regardless of concatenation order:

    concat3(., ., .)
    

    .. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.

    concat3(27,e,-7)
    

    Here, -7 is two tokens, so it cannot be concatenated.

    And here is a case in which concatenation order matters:

    concat3(27e, -, 7)
    

    If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.

    What exactly is a pp-number?

    In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:

    • It starts with a decimal digit or a period (.) followed by a decimal digit.
    • The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
    • In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
    • Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).

    Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.