Search code examples
compiler-constructionlexical-analysis

Confusion in lexical analysis


Let us take the state diagram for identifiers in lexical analysis. Basically, it says that return the token as (identifier,attribute) pair whenever the analyzer reads any character other than a letter of a digit. So, according to this rule, while reading the string dtf56*f%%f, will the tokens generated be the following?

dtf56: Identifier

f: Identifier

f: Identifier

What I suppose is that the lexical analyzer should throw an error in this case, since this is a single string. As a general question, at what "other" characters should a lexeme be returned?

State diagram for identifiers


Solution

  • If the asterisk and percent sign are legal characters it should return them too, separately. The point of your first sentence is that analysis should stop and return the token accumulated so far when a character that can't be part of it is encountered.

    What I am confused about is when should I return a lexeme. For example, for the string 56fdt, should I return 56 as integer and fdt as an identifier? Or, should I throw an error?

    According to your state diagram you should return them separately. An identifier can only start with a letter. That's the meaning of the notation.

    You should only 'throw an error' if you encounter a character that isn't part of the alphabet of the language you're scanning, and with practical tools such as flex(1) it is actually better to return those to the parser as well (assuming yacc(1) and friends) to let the parser's error-recovery rules take effect, rather than just printing a possibly lengthy string of 'illegal character' errors.

    So, the bottom-line is to follow the state transition diagrams without question? (I feel like a knucklehead while asking this question).

    The state diagram says if you find a letter in state 9, transition to state 10, and stay in it while you have more letters or digits, then, when that stops, output the accumulated token as ID. You should certainly follow the state diagram without question if it is correct for the language you're analyzing. [There are languages in which 56fdt is a legal identifier, but in that case the state diagram would be different, very different.]