Search code examples
cparsingcompiler-constructionflex-lexer

When space (or parentheses) are required in C during compilation?


I am learning how compilation works and my final goal is to write a mini C compiler. I am still at the beginning of this project. As I was working on the scanner and parser parts to build the AST, I realized that space is (or parentheses are) required in expressions like that i+ +4, i+(+4), i- -4, or i-(-4). Otherwise, in i--4 expression (for example), -- is interpreted as the unary operator -- and an error is raised. I understand perfectly the reason. This is not the question.
The question is the following, Before, I though naively that spaces were not so important in C if only for concerns of code readability. But now, I wonder if there are another examples like theses described above ?


Solution

  • The rules for when spaces are needed in C are not specified explicitly but are consequences of how C is parsed. The rules for this are fairly complicated, as they involve multiple phases of analysis and some exceptions for various situations. If you are writing a C compiler, you need to be using the C standard as a reference.

    C 2018 5.1.1.2 specifies translation phases (rephrasing and summarizing, not exact quotes):

    1. Physical source file multibyte characters are mapped to the source character set. Trigraph sequences are replaced by single-character representations.

    2. Lines continued with backslashes are merged.

    3. The source file is converted from characters into preprocessing tokens and white-space characters—each sequence of characters that can be a preprocessing token is converted to a preprocessing token, and each comment becomes one space.

    4. Preprocessing is performed (directives are executed and macros are expanded).

    5. Source characters in character constants and string literals are converted to members of the execution character set.

    6. Adjacent string literals are concatenated.

    7. White-space characters are discarded. “Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.” (That quoted text is the main part of C compilation as we think of it!)

    8. The program is linked to become an executable file.

    Primarily, where spaces are needed in C source code is governed by phase 3, the formation of preprocessing tokens. This is specified in C 2018 6.4. A grammar for preprocessing tokens is given in paragraph 1 (more on this below), and paragraph 4 tells us:

    If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token. There is one exception to this rule: header name preprocessing tokens are recognized only within #include preprocessing directives and in implementation-defined locations within #pragma directives. In such contexts, a sequence of characters that could be either a header name or a string literal is recognized as the former.

    Paragraph 1 tells us a preprocessing token is one of a header-name, identifier, pp-number, character-constant, string-literal, punctuator, or a non-white-space character that is not one of the preceding items.

    Then further subclauses in 6.4 tell us what those tokens look like.

    Phase 3 induces two rules for where you need a space that are essentially:

    • If the source code would be parsed, according to the above rules, as one preprocessing token where you want two, then you must insert a space where you want the first token to end.
    • If using / and * other than as /* to introduce a comment, put a space between them.

    Phase 4 induces another rule. Because 6.10.3 3 says “There shall be white space between the identifier and the replacement list in the definition of an object-like macro,” you need a space to distinguish a function-like macro:

    #define foo(x) (3*(x)) // Macro that acts on argument x.
    #define foo (x)        // Macro that expands to `(x)`.