Search code examples
compiler-constructionwhitespace

Why can't one omit white space after primitive data types?


Why don't compilers let me use intmain instead of int main? Don't compilers discard white spaces at compile time?


Solution

  • Most tokenizers discard white space in the sense that they don't generate a token for it. But that doesn't mean that white space has no effect at all: it still forces one token to end and the next to begin.

    One way to think of it is that a tokenizer has three jobs:

    1. Split the input text into lexemes (substrings).
    2. Classify each lexeme into a category like "identifier" or "white space".
    3. Generate a stream of token values that contain the value and the classification of each lexeme.

    The way that jobs 1 and 2 work is that we define rules for each type of token. The tokenizer then takes the rule which matches the longest prefix of the current input. It then generates the lexeme and classifies it based on which rule was used to match it.

    So what we mean when we say that tokenizers "ignore" white space is that job 3 doesn't generate a token for lexemes that have been classified as white space. This does not in any way affect job 1.

    To illustrate this, the three jobs for the input int main() would look as follows:

    1. "int", " ", "main", "(", ")"
    2. KEYWORD_INT, SPACE, IDENTIFIER, OPEN_PAR, CLOSING_PAR
    3. KEYWORD_INT, IDENTIFIER("main"), OPEN_PAR, CLOSING_PAR

    And for intmain() it would look like this:

    1. "intmain", "(", ")"
    2. IDENTIFIER, OPEN_PAR, CLOSING_PAR
    3. IDENTIFIER("main"), OPEN_PAR, CLOSING_PAR

    The reason that we get intmain and not int, main is that the identifier rule still keeps matching after the t.

    In a comment on another answer you asked, why int doesn't get priority. That'd mean that we change the logic of job 1, so that it doesn't take the rule with the longest match, but rather always prefers the "keyword int" rule over the "identifier" rule. This would make it impossible to ever have an identifier that starts with "int" (because that would always be classified as the keyword int), so that'd be a really bad idea.