I am developing a "toy" compiler with flex and bisonc++ in ubuntu, which compile a C-like input language into optimized C++. In the input language contains a main functions (must have) and can have optionals functions outside the main, for example:
int myFunc1()
{
// some functionality
}
int main()
{
}
void myFunf2()
{
// some functionality
}
At the semantic analysis part I got trouble, I do not know how handle the scopes, I do not now know, what kind of symbol table structure should I provide or what kind of data structure should i create for this problem. There is a complex case for my problem (for the semantic analysis part):
int Name() // here "Name" is a function name
{
int Name = 0; // Here Name is a variable name
return Name // the function returns with the variable
}
int main()
{
int Name = 5; // Here name is a variable name, it is not the same variable which is in the Name()
function, it is a **different** **variable**
if ()
{
// int Name = 6; <- Name variable cannot be redeclared in this inner if block
// A variable can be declared once in a function it cannot be redeclared in an if, while, etc. block
}
}
void Name2()
{
int Name = 55; // here Name is also a different variable from the others
}
The simplest implementation of a scoped symbol table is a searchable stack capable of storing both symbols and scope markers. The operations are simple:
That datastructure could also be used if you later decide to allow locally shadowed declarations, as C does.
What you store in the declaration objects on the stack is highly dependent on the nature of your compiler. In the simplest case, local variables need only to be associated with a type and an offset in the current stack frame. Since a scope also effectively delimits variable lifetime, you can use the scope marker to keep the current offset in the stack frame, allowing you to restore the offset when you exit the scope.
Global variables would need a different treatment, in part because function names can be stored in the global symbol table and in part because allocation of the storage for global variables follows a different discipline. So it may be convenient to keep the globals in a different container, probably using something like a hash table, and search this container if the name is not found in the symbol stack.
Parenthetically, I know that suggesting a linear search of names always raises eyebrows. Indeed, it is not optimal in terms of execution time. However, at this stage in your project, it's more important to focus on getting things working; you can always add a more efficient lookup procedure later. (Make sure your symbol table provides a simple and well-document programming interface. That will make it much easier to later modify the implementation.) Anyway, it's really not a big issue. Experience shows that the vast majority of visible names at any point are global (corresponding to library functions, many of which are never actually used by a given source text). Local names are typically used close to their declaration. So the time spent in a linear lookup is not going to be a deal-breaker. If it worries you, profile your compiler to see if it shows up as significant before spending a lot of effort on optimising.
An additional complication arises if it is possible to declare a global variable within a scope (as it is in C, using the extern
keyword). (This might not be applicable to your language; it sounds like you are not contemplating programs spread over multiple source files. But you might some day want to add that.) To implement linkage correctly, you need to be clear on the difference between the scoped name and the externally linked object's name. These are the same name, but they have different rules: the locally scoped name is only visible in the scope in which it is declared, while the externally linked object's name will be visible to the linker so it needs to be added to yet another symbol repository which is not searched by name lookup. Furthermore, the same external object can be declared in multiple scopes.
All that assumes that you do not have nested function declarations in your language. C doesn't have these, so that seemed like a safe assumption. (And translating closures into C++ would be an additional challenge, which I think is outside of the scope of this post.)