Search code examples
ctokenlexical-analysislexical

Clarification regarding lexical errors in C


I have already read this and this questions. They are quite helpful but still I have some doubt regarding token generation in lexical analyzer for C. What if lexical analyzer detects int a2.5c; then according to my understandings 7 tokens will be generated.

int keyword
a identifier
2 constant
. special symbol
5 constant
c identifier
; special symbol

So Lexical analyzer will not report any error and tokens will be generated successfully.

Is my understanding correct? If not then can you please help me to understand?

Also If we declare any constant as double a = 10.10.10;
Will it generate any lexical errors? Why?

UPDATE :Asking out of curiosity, what if lexical analyzer detects :-) smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings : will be treated as special symbol, - will be treated as operator and again ) will be treated as special symbol
Thank You


Solution

  • Your first list of tokens is almost correct -- a2 is a valid identifier.

    Its true that the first example won't generate any "lexical" errors per se, although there will be a parse error at the ..

    It's hard to say whether the error in your second example is a lexical error or a parse error. The lexical structure of a floating-point constant is pretty complicated. I can imagine a compiler that grabs a string of digits and . and e/E and doesn't notice until it calls the equivalent of strtod that there are two decimal points, meaning that it might report a "lexical error". Strictly speaking, though, what we have there is two floating-point constants in a row -- 10.10 and .10, meaning that it's more likely a "parse error".

    In the end, though, these are all just errors. Unless you're taking a compiler design/construction class, I'm not sure how important it is to classify errors as lexical or otherwise.


    Addressing your follow-on question, yes, :-) would lex as three tokens :, -, and ).

    Because just about any punctuation character is legal in C, there are relatively few character sequences that are lexically illegal (that is, that would generate errors during the lexical analysis phase). In fact, the only ones I can think of are:

    • Illegal character (I think the only unused ones are ` and @)
    • various problems with character and string constants (missing ' or ", bad escape sequences, etc.)

    Indeed, almost any string of punctuation you care to bang out will make it through a C lexical analyzer, although of course it may or may not parse. (A somewhat infamous example is a+++++b, which unfortunately lexes as a++ ++ + b and is therefore a syntax error.)