Implementing the symbol table for a domain specific language using Python

Suppose I have a typed expression in some domain specific language:

x+y<=z

where x,y,z are typed as int

Which built-in data structure in python should we use to implement the symbol table? I only know dictionary so the symbol table can be implemented as

{x:'int', y:'int', z:'int'},

but maybe there is other better alternatives?

Solution

The basic concept of of a symbol table is mapping of identifiers-in-a-scope to information about the identifier (type, uses, ...)

So any mechanism that associates a name (almost always a string) with a "type value" is just fine as a foundation. So, dictionaries will work. (In fact, hash tables over identifier keys are the classic way to implement this).

But you need more that that for real symbol tables. You need to associate each such map with the scope in which it is valid. In many classic Algol like languages, such scopes are introduced by nested blocks. In more complex languages (e.g., C++) one has namespaces and other complex scoping structures and the relationship of the maps to the scopes may require a complex mapping back to the souce code (or AST nodes or whatever you are using as a representation).

Lookup in the "symbol table" requires rules about how one determines the current scope (thus the current identifier-to-type map), and what to do if the identifier is found in that scope, and what to do when it is NOT found in that scope (usually, look in another scope defined by language rules). Complex languages that allow overloading may require multiple entries in a scope to represent overloaded names; suddenly a simple dictionary isn't enough, you might need a tree of choices attached to each identifier found in the map or a more complex mapping of the identifier with signature data to a scope entry.

In many Algol like languages, "look in another scope" requires going up the "lexical nesting" of blocks, so each map must have an association with a parent scope. Complex languages such as C++ may have multiple inheritance rules; now you have to be able to determine which ("parent") scopes may contribute to the inheritance, and what order to search the parents. Because complex languages may have many different lookup rules depending on the context of the symbol, each identifier map may need its specific policy (procedural attachment) about how it does local lookups (e.g, handles found overloads) and how it handles failed lookups.

So while a dictionary is enough for a really simple language with only one scope, in practice you need a lot more "structure" to store a symbol table for a complex language.

And if you believe that your "simple" language will only have small instances, and thus need just a single scope, you will be rudely surprised by what your users end up doing. (Ever seen a thousand line SQL statement?) As the DSL instances get big, you need more scoping rules to make them manageable, and you'll end up will some or all of the complications I describe above. Think long term as you do this.

(Check my bio for a tool for building DSLs, that has symbol table machinery that handles all of the above. Not implemented in Python, though).