Search code examples
compiler-errorscompiler-constructionlexical

Lexical Errors: misspelling of identifiers


In the dragon book it is mentioned that "Lexical errors include misspelling of identifiers. e.g. misspelling ellispeSize as elipseSize"

I cannot understand how is it true (that lexical analyzer can indentify misspelled identifiers and cause error). Both ellipseSize and elipseSize are valid lexemes for being identifier. So lexical analysis phase should have no issues with that. Possibly in later phases down the compilation we shall get to know that elipseSize has not been defined ...


Solution

  • "Lexical error" and "error detected during lexical analysis" are not the same thing.

    Actually, neither of those categories have any formal foundation. Formal language theory doesn't provide any definitive boundary between lexical and syntactic analysis, or even a definitive boundary between syntactic and semantic analysis. There isn't even an absolute requirement that lexical and syntactic analysis be separated; some practitioners seem to prefer so-called "scannerless parsing" where a single grammar is used to directly convert a stream of characters into a parse tree.

    In the passage you cite, I believe the authors are trying to produce an intuitive rather than formal classification of error types, although I'm not sure that the list serves any particular didactic purpose. In an earlier passage in the textbook (at the beginning of Chapter 3), they note the problem with detecting misspellings during a lexical scan:

    It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. For example, if the string fi is encounterd… a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared function identifier.

    Intuitively, misspelling a keyword or identifier is lexical in the sense that the error occurred while the programmer was typing a single lexeme. But most misspellings cannot be detected during lexical analysis. In a typical compiler, many won't be detected until semantic analysis reveals that the misspelling is not a declared variable. And if it happens to be a different variable declared elsewhere, the error might not be detected until it emerges as an execution-time bug at some later point in time.

    It seems unuseful that the same typing error might be classified in four different ways, which is why it's not unreasonable to classify it as a lexical error. Most of us would, in fact, say, "that's a typo", regardless of the stage of compilation/execution at which it was detected.

    As a final note, I know that many people would say that an undeclared variable is a "semantic" error, not a "syntactic" error (and that's another question you might find on an exam.) But that distinction is equally arbitrary; it is certainly possible to write a grammar which requires variables to be declared in the scope which contains their use. Such a grammar won't be context-free, but it is still a grammar. In fact, we commonly refer to "lexical scope"; not even "syntactic scope". That reflects the fact that for most languages, you can mechanically trace an identifier's use to a particular declaration without knowing anything about the semantics of the program. And it is common for identifiers to be correlated with their scope during AST construction. (In C++, the name resolution algorithm is incredibly intricate, but in many other languages it is quite simple.)

    Following paragraph contains only unsubstantiated opinions.

    I know that there are teaching environments in which the students might confront an exam question based on this textbook in which they are asked to classify different concrete programming errors according to the offhand categories shown in the section of the Dragon Book you cite. I firmly believe that this is (yet another) illustration of how badly computer science students are served by (certain) academic environments. It's possible that my belief that parsing theory is particularly plagued by bad teaching methods is simply a reflection of my greater familiarity with parsing theory than other aspects of computer science. But in general, rote learning is a terrible pedagogical method for a subject in which students need to learn how to think, and not just to regurgitate arbitrary lists. Computer science is far from the only such subject; there are very few subjects which could usefully be taught at the university level which don't fundamentally require analytic skills.