Search code examples
methodscompiler-constructionlookupabstract-syntax-treeinterpreter

Compiler/Interpreter design: should built in methods have their own Node or should a lookup table be used?


I'm designing a interpreter which uses recursive descent and I've got to the point where I'm starting to implement built in methods.

One example of a method I'm implementing is the print() method which outputs to the console, just like python's print() method and Java's System.out.println().

However it has come to my attention that there are multiple ways of implementing these built in methods. I'm sure there is many more but I have determined 2 viable ways of achieving this and I'm trying to establish which way is the best practise. For context below is the different layers I'm using within my interpreter which is loosely based off of https://www.geeksforgeeks.org/introduction-of-compiler-design/ and other tutorials I've come across.

  1. Lexer
  2. Parser
  3. Semantic Analyser
  4. Interpreter / Code Generator

1. Creating a AST Node for each individual built in method.

This method entails programming the parser to generate a node for each individual method. This means that a unique node will exist for each method. For example:

The parser will look to generate a node when a TPRINT token is found in the lexer.

print : TPRINT TLPAREN expr TRPAREN {$$ = new Print($3);}
      ;

And this is what the print class looks like.

class Print : public Node {
public:
    virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
    virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
    Node* value;
    Print(Node* cvalue) {
        value = cvalue;
    }
}

From there I define the visit_Semantic and visit_interpreter methods, and visit them using recursion from the top node.

I can think of a couple advantages/disadvantages to using this method:

Advantages

  • When the code generator walks the tree and visits the Print node's visit_interpreter method, it can straight away execute the response directly as it is programmed into it's visit methods.

Disadvantages

  • I will have to write a lot of copy paste code. I will have to create a node for each separate method and define it's parser grammar.

2. Creating a Generic AST Node for a method call Node and then use a lookup table to determine which method is being called.

This involves creating one generic node MethodCall and grammar to determine if a method has been called, with some unique identifier such as a string for the method it is referring to. Then, when the MethodCall's visit_Interpreter or visit_Semantic method is called, it looks up in a table which code to execute.

methcall : TIDENTIFIER TLPAREN call_params TRPAREN {$$ = new MethodCall($1->c_str(), $3);}
         ;

MethodCall Node. Here the unique identifier is std::string methodName:

class MethodCall : public Node {
public:
    virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
    virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
    std::string methodName;
    ExprList *params;
    MethodCall(std::string cmethodName, ExprList *cparams) {
        params = cparams;
        methodName = cmethodName;
    }
};

Advantages:

  • One generic grammar/node for all method calls. This makes it more readable

Disadvantages:

  • At some point the unique identifier std::string methodName will have to be compared in a lookup table to determine a response for it. This is not as efficient as directly programming in a response to a Node's visit methods.

Which practise is the best way of handling methods within a compiler/interpreter? Are there different practises all together which are better, or are there any other disadvantages/advantages I am missing?

I'm fairly new to compiler/interpreter design so please ask me to clarify if I have got some terminology wrong.


Solution

  • As far as I see it, you have to split things up into methods somewhere. The question is, do you want to implement this as part of the parser definition (solution 1) or do you want to implement this on the C++ side (solution 2).

    Personally, I would prefer to keep the parser definition simple and move this logic to the C++ side, that is solution 2.

    From a runtime perspective of solution 2, I wouldn't worry too much about that. But in the end, it depends on how frequently that method is called and how many identifiers you have. Having just a few identifiers is different from comparing say hundreds of strings in an "else if" manner.

    You could first implement it the simple straight-forward way, i.e. match the string identifiers in an "else if" manner, and see if you run into runtime issues.

    If you experience runtime issues, you could use a hash function. The "hardcore" way would be to implement an optimal hash function yourself and check hash function optimality offline, since you know the set of string identifiers. But for your application that would be probably be rather overkill or for academic purposes, and I would recommend to just use unordered_map from the STL (which uses hashing under the hood, see also How std::unordered_map is implemented) to map your string identifiers to index numbers, so that you can implement your jump table with an efficient switch operation on those index numbers.