Why don't compilers let me use intmain
instead of int main
? Don't compilers discard white spaces at compile time?
Most tokenizers discard white space in the sense that they don't generate a token for it. But that doesn't mean that white space has no effect at all: it still forces one token to end and the next to begin.
One way to think of it is that a tokenizer has three jobs:
The way that jobs 1 and 2 work is that we define rules for each type of token. The tokenizer then takes the rule which matches the longest prefix of the current input. It then generates the lexeme and classifies it based on which rule was used to match it.
So what we mean when we say that tokenizers "ignore" white space is that job 3 doesn't generate a token for lexemes that have been classified as white space. This does not in any way affect job 1.
To illustrate this, the three jobs for the input int main()
would look as follows:
"int", " ", "main", "(", ")"
KEYWORD_INT, SPACE, IDENTIFIER, OPEN_PAR, CLOSING_PAR
KEYWORD_INT, IDENTIFIER("main"), OPEN_PAR, CLOSING_PAR
And for intmain()
it would look like this:
"intmain", "(", ")"
IDENTIFIER, OPEN_PAR, CLOSING_PAR
IDENTIFIER("main"), OPEN_PAR, CLOSING_PAR
The reason that we get intmain
and not int, main
is that the identifier rule still keeps matching after the t
.
In a comment on another answer you asked, why int
doesn't get priority. That'd mean that we change the logic of job 1, so that it doesn't take the rule with the longest match, but rather always prefers the "keyword int" rule over the "identifier" rule. This would make it impossible to ever have an identifier that starts with "int" (because that would always be classified as the keyword int
), so that'd be a really bad idea.