Search code examples
cconcatenationc-preprocessortoken-pasting-operator

Why does the preprocessor forbid macro pasting of weird tokens?


I am writing my own C-preprocessor based on GCC. So far it is nearly identical, but what I consider redundant is to perform any form of checking on the tokens being concatenated by virtue of ##. So in my preprocessor manual, I've written this:

3.5 Concatenation

...

GCC forbids concatenation with two mutually incompatible preprocessing tokens such as "x" and "+" (in any order). That would result in the following error: "pasting "x" and "+" does not give a valid preprocessing token" However this isn't true for this preprocessor - concatenation may occur between any token.

My reasoning is simple: if it expands to invalid code, then the compiler will produce an error and so I don't have to explicitly handle such cases, making the preprocessor slower and increasing in code complexity. If it results in valid code, then this restriction removal just makes it more flexible (although probably in rare cases).

So I would like to ask, why does this error actually happen, why is this restriction actually applied and is it a true crime if I dismiss it in my preprocessor?


Solution

  • As far as ISO C goes, if ## creates an invalid token, the behavior is undefined. But there is a somewhat strange situation there, in the following way. The output of the preprocessing translation phases of C is a stream of preprocessing tokens (pp-tokens). These are converted to tokens, and then syntactically and semantically analyzed. Now here is an important rule: it is a constraint violation if a pp-token doesn't have a form which lets it be converted to a token. Thus, a preprocessor token which is garbage that you write yourself without help from the ## operator must be diagnosed for bad lexical syntax. But if you use ## to create a bad preprocessing token, the behavior is undefined.

    Note the subtlety there: the behavior of ## is undefined if it is used to create a bad preprocessing token. It's not the case that the pasting is well-defined, and then caught at the stage where pp-tokens are converted to tokens: it's undefined right from that point where ## is evaluated.

    Basically, this is historical. C preprocessors were historically (and probably some are) separate programs, with lexical analysis that was different from and looser from the downstream compiler. The C standard tried to capture that somehow in terms of a single language with translation phases, and the result has some quirks and areas of perhaps surprising under-specification. (For instance in the preprocessing translation phases, a number token ("pp-number") is a strange lexical grammar which allows gibberish, such as tokens with multiple floating-point E exponents.)

    Now, back to your situation. Your textual C preprocessor does not actually output pp-token objects; it outputs another text stream. You may have pp-token objects internally, but they get flattened on output. Thus, you might think, why not allow your ## operator to just blindly glue together any two tokens? The net effect is as if those tokens were dumped into the output stream without any intervening whitespace. (And this is probably all it was, in early preprocessors which supported ##, and ran as separate programs).

    Unfortunately what that means is that your ## operator is not purely a bona fide token pasting operator; it's just a blind juxtaposing operator which sometimes produces one token, when it happens to juxtapose two tokens that will be lexically analyzed as one by the downstream compiler. If you go that way, it may be best to be honest and document it as such, rather than portraying it as a flexibility feature.

    A good reason, on the other hand, to reject bad preprocessing tokens in the ## operator is to catch situations in which it cannot achieve its documented job description: the requirement of making one token out of two. The idea is that the programmer knows the language spec (the contract between the programmer and the implementation) and knows that ## is supposed to make one token, and relies on that. For such a programmer, any situation involving a bad token paste is a mistake, and that programmer is best supported by diagnosis.

    The maintainers of GCC and the GNU CPP preprocessor probably took this view: that the preprocessor isn't a flexible text munging tool, but part of a toolchain supporting disciplined C programming.

    Moreover, the undefined behavior of a bad token paste job is easily diagnosed, so why not diagnose it? The lack of a diagnosis requirement in this area in the standard looks like just a historic concession. It is a kind of "low-hanging fruit" of diagnosis. Let those undefined behaviors go undiagnosed for which diagnosis is difficult or intractable, or requires run-time penalties.