Say I have a C source code file with following content:
int i = 21 + 10;
int blah(){
int i = 21;
return i + 10;
}
main(){
int i;
i += i + 10;
}
at the end of lexical analysis phase, what will be the content of Symbol table? Will there be multiple entries for i
and 10
or will the lexer only place unique values?
As I understand, the token stream will contain all tokens as they appear, but I'm not sure about symbol table.
A lexical analysis merely breaks the source code stream of characters, into a stream of tokens that represent the atoms (words) of the language.
In general, no symbol table is built by a lexer for a language.
C is rather special case because of the preprocessor. The C preprocessor must collect the names of macros, and their corresponding values, as the the preprocessor directives are read. Usually (may vary) a C preprocessor (at least ours does, see my bio) operates by breaking the source into a stream of tokens, and processes those tokens that are in lines that start with "#" (e.g. preprocessor lines). Since the source stream depends on expanded macros, and expanding a macro requires the preprocessor know its expansion, the preprocessor records, in a global scope for macro definitions, the macro names and the macro body associate with that that, as it encounters such definitions. It uses the macro names to expand macro invocations and evaluate preprocessor conditions that it encounters as it is lexing. But the preprocessor does not build any other symbol tables.
Ignoring the preprocessor, normally symbol tables are built after parsing the source text, certainly if the compiler is going to do any kind of global analysis. This is easiest if there is an abstract syntax tree (AST) produced by the compiler. Some compilers, especially those that generate code on the fly (rare these days), may build a symbol table as it parses and encounters entries that should go into a scope.
C compilers again have a common special case that occurs, due the widespread use of weak parsing technology. Such technology cannot distinguish, by pure parsing, certain obscure syntax (notably " X* Y; "). To distinguish such, one needs information about whether certain symbols are type declarations or not; the compilers with weak parsing technology then tangle collection of at least type names in symbol scopes into the parsing process. This makes the code for such weak parsers very hard to understand. The links provided make it clear that tangling is not actually necessary and one and one can thus cleanly isolate parsing from symbol table construction in nicely-engineered compilers.
With a sufficiently complex language, such as C++, one cannot build the symbol tables without parsing the entire source text. The notion of "namespace" means that an entry for a namespace may not be encountered until the very last declaration in a source file. In this case, the contents of that namespace (clearly a symbol scope) cannot be completed until that last entry is processed.
If this all seems messy, it is. You are asking a general question which has to address the problems of hundreds of programming languages, many with strange rules about when symbols are defined and how they must be used to interpret the rest of the program source text.
Final summary: