c++parsing polymorphism abstract-syntax-tree recursive-descent

Polymorphic Abstract Syntax Tree (recursive descent parser): impossible?

I have begun writing a polymorphic recursive descent parser in C++. However I am running an issue. The classes are set up like this:

class Node {
public:
    std::vector<Node*> children;
};

class NodeBinary : public Node {
public:
    Node* left;
    Node* right;
};

class NodeUnary : public Node {
public:
    Node* operand;
};

class NodeVar : public Node {
public:
    std::string string;
    NodeVar(std::string str) : string(str) {};
};

class NodeNumber : public Node {
public:
    signed long number;
    NodeNumber(signed long n) : number(n) {};
};

// etc.

And then classes like NodeDeclaration, NodeCall, NodeNot, NodeAssignment, NodePlus, NodeMinus, NodeIf etc. will inherit either from Node or something less generic like NodeBinary or NodeUnary.

However, some of them take more specific operands. NodeAssignment always takes a var and a number/expression. So I will have to override Node* left to NodeVar* left and NodeExpr* right. The problem comes in with things like NodePlus. Left can be a NodeVar or a NodeExpr! And the root node has a similar problem: while parsing at the top level to add children nodes to root, how is it possible to tell if a child is a NodeExpr, a NodePlus, a NodeIf etc...?

I could have all Nodes have a enum "type" that says what type it is, but then whats the point of having a nice polymorphic inheritance tree?

How is is this problem normally solved??

Solution

If you're using class inheritance for your AST nodes, you need to create an appropriate inheritance hierarchy, as with any object-oriented design.

So, for example, NodeAssignment (which is presumably a specialization of NodeStatement) needs to contain a NodeLValue (of which a NodeVariable is a specialization) and a NodeValue. As usual, LValues (i.e. things you can assign to) are a subset of Values, so NodeLValue will be a specialization of NodeValue. And so on. Your binary operator node will contain left and right members, both of which are NodeValue base objects (I would expect NodeValue to be pure virtual, with a large number of specific specializations.)

If you insist on using a recursive descent parser, each parsing function needs to return an appropriate subclass of Node, so that the function which parses the left-hand side of an assignment would logically return a NodeLValue*, ready to insert into the NodeAssignment constructor. (Frankly, I'd ditch the word Node in all of those class names. Put them all into the namespace node:: and save yourself some typing.)