While building some sort of interpreter for a simple programming language, i stumbled upon an interesting problem. I call it the "Symbol Lookahead" problem.
What do i mean with this? For example, in C/C++ compilers, symbols that you are about to use, must always already be declared somewhere above in the code. Like this:
struct int_pair;
struct rectangle {
int_pair position;
int_pair size;
};
struct int_pair {
int x, y;
};
and not like this:
struct rectangle {
int_pair position;
int_pair size;
};
struct int_pair {
int x, y;
};
While in C# or Java, its okay to use any symbol everywhere you like in a file:
public class Rectangle {
private IntPair position, size; // using IntPair before declaring it
}
public class IntPair {
public int Sum() { // using x and y before declaring it
return x + y;
}
public int x, y;
}
So what is my problem?
I figured out, that while building the AST, it is useful to know if a Symbol in my code is referring to a function or a variable or something, so the interpreter can decide, what the following code should look like, which improves alotta things. This requires that this symbol is already declared somewhere in my program, but i really dont like this "C"-style programming. What i want is a programming style like in Java or C#: declarations can be anywhere in the code, you are not bound to a specific order - and the interpreter can depend on the symbol-meaning while building the AST.
One way of solving this may be building some sort of lookahead-algorithm that flies over the program and catches all relevant symbols, and nothing else. But to me, this seems quite messy and workaround-like.
So my problem is, that is want the c# and Java way of declaring stuff (see code snippets above), but i figure out how to implement this feature in an interpreter myself. So im asking for an idea/code snippet/algorithm/strategy that lets me lookahead symbols and their meaning.
help is very much appreciated. :)
The syntaxes of Java and C# are designed in such a way that expressions, such as function calls, are unambiguous even if you don't know the types of the subexpressions (including identifiers). C is not designed that way.
One of the advantages of unambiguous syntaxes is that you can create an AST without doing any type analysis, which means there is no need for predeclarations. Once you have the AST, you can do whatever analyses you wish, such as checking function prototypes against call sites, by walking the parse tree.
And there is no need to do all the analyses simultaneously; you can walk the tree as many times as necessary. This makes it much easier to organise your code, and doesn't impose any major overhead.