Search code examples
javascriptalgorithmtreeautomata

How to implement a basic iterative pushdown automaton parsing algorithm with literal states and transitions in JavaScript?


We desire to create a pushdown automaton (PDA) that uses the following "alphabet" (by alphabet I mean a unique set of symbol strings/keys):

+aaa
+bbb
+ccc
+ddd
+eee
-aaa
-bbb
-ccc
-ddd
-eee

The "symbols" (3-letter sequences with a + or - prefix) in this alphabet are used to make trees. Here are some example trees:

+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb

Visualized as an actual tree, they would be more like:

+eee
-eee

+aaa
  +bbb
  -bbb
-aaa

+bbb
  +aaa
    +ccc
      +eee
      -eee
    -ccc
    +ccc
    -ccc
    +ddd
      +ccc
        +eee
        -eee
      -ccc
    -ddd
  -aaa
-bbb

So given this alphabet and these example trees, the question is how you would write a generic pushdown automaton to parse these trees. The rules are:

  1. Any letter pair (open/close pair) can have any number of nested children, and it doesn't matter what letter pairs are nested as children.

How would you write a pushdown automaton in JavaScript to parse a string into an AST?

By this I mean, the implementation must literally have a stack, states, and transitions. By this I mean, not implementing an ad-hoc parser, or even a recursive descent parser. This should be an iterative while loop looping through transitions somehow, not using recursion.

The output should be a very simple "AST" that just looks like this (for the +aaa+bbb-bbb-aaa tree):

{
  "type": "aaa",
  "children": [
    {
      "type": "bbb",
      "children": []
    }
  ]
}

I am wondering how to build a PDA so I can, in my specific case working on a custom programming language, convert a rather involved and complex AST I am working and parse it into an object graph. That question is too complicated to write out on SO, and too hard to simplify, which is why I am asking here about a very basic PDA implementation in JavaScript.

I am interested to see how you keep track of the context at each branch point when doing the PDA algorithm and what the transitions look like.

Note: I've heard of / seen "2-state" PDAs occasionally mentioned here and there. It sounds like you have one state to do the computation, and one "final" state. Maybe even just a single state. If it's possible to do it sort of like this, then that would be cool. If not, don't worry about it.

If necessary, you can first write a Context Free Grammar and use that to dynamically construct a PDA, but the key question here is how the PDA is technically implemented. I think it's probably possible / straightforward to just write each transition function by hand, but I'm not entirely sure. Either way works for me.


Solution

  • It is not possible to create a PDA for producing this tree data structure because:

    • a PDA has a finite number of states and stack symbols, and no output tape. Yet the number of trees that this PDA should be able to somehow represent (and make available) is infinite. If we consider the nodes of a tree in an object-oriented structure, then there are an infinite number of different nodes possible, as a node's identity is determined also by the references it has. This is in conflict with the finite set of symbols that a PDA has. If as alternative we would not go for an object-oriented output, then we gain nothing: the input already is a serialized representation of the tree.
    • Even if you would add an output tape to this automaton, making it a rewrite system, you wouldn't gain much, since the input is already representing the tree in a serialized format. If I understood correctly, you are not interested in serialized output (like JSON, which would be trivial as the output order is the same as the input order), but structured output (JavaScript-like objects).

    So, as others have also noted, we need to loosen the rules somehow. We can think of several (combinations of) ways to do that:

    1. Allow for an infinite number of states, or
    2. Allow for an infinite stack alphabet, or
    3. Allow the PDA to have access to more of the stack than just its top element
    4. Let the transition table refer to a particular attribute of the stack's top element, not the element as a whole -- allowing infinite possibilities for other attributes of the stacked elements, while ensuring that the values of this particular attribute belong to a finite set.
    5. Keep the tree building outside of the PDA
    6. ...Any other measures for solving the incompatibility of PDA with the requirement to produce a structured tree.

    Some of these options would infer an infinite transition table, and so in that case it cannot be a table, but could be a function.

    Implementation choices

    1. For the generic engine

    Considering your focus on tree-building, and the requirement to have "a stack, states, and transitions", I went for the following implementation choices, and sometimes simplifications:

    • I take strategy 4 from the above list
    • The generic engine will itself create the tree. So it is only generic in a limited sense. It can only validate and generate trees.
    • It will (only) take a transition table as configuration input
    • It has only two states (boolean), which indicate whether all is OK so far, or not.
    • The OK-state can only go from true to false when there is no matching transition in the transition table
    • Once the OK-state is false, the engine will not even attempt to find a transition, but will just ignore any further input.
    • By consequence the transition table will not include the current state and future state parts, as they are both implied to be true (i.e. "OK").
    • The tree will be encoded in the stack. The data property of a stack element will be the special attribute that will serve for identifying transitions. The children attribute of each stack element will serve to define the tree and is not considered part of the stack alphabet that the transition table targets.
    • The stack starts out with one element. Its children attribute will be the output of the engine. At every step this output can be consulted, since this element is never removed from the stack.
    • The input symbols cannot include the empty string, which is reserved for signalling the end of the input stream. This allows the engine to give feedback whether that is a valid moment to end the input stream.
    • The engine requires that its stack is "empty" (has just 1 entry) when the end of the input is indicated.
    • I assumed that input symbols do not contain spaces: the space is used as a delimiter in the transition lookup algorithm
    • The given transition table is converted to a (hash)map to allow for fast lookup given the current state and the data at the top of the stack. The keys in this map are concatenations of input and stack data values, and it is here that the space was chosen as delimiter. If a space is a valid character in an input symbol, then another encoding mechanism should be selected for serializing this key-pair (e.g. JSON), but I wanted to keep this simple.
    • The engine does not need to get the set of input symbols, nor the set of stack symbols as configuration input, as these will be implied from the transition table.
    • The engine's initial state is set to OK (it could be set differently by simple property assignment, but that would be useless as it would then ignore input)

    2. For the specific +/- input format

    The more specific part of the solution deals with the actual input format you have given in the question. One function will covert the set of types to a transition table, and another will tokenize the input based on the "+" and "-" characters, but without any validation (no symbol check, nor symbol length check, ...), as this will all surface as error anyway when the engine is called.

    Any white space in the input is ignored.

    Implementation

    // To peek the top element of a stack:
    Array.prototype.peek = function () { return this[this.length - 1] };
    
    function createParser(transitions) {
        function createNode(data) {
            return { 
                data, 
                children: [] 
            };
        }
    
        function addChild(parentNode, data) {
            const childNode = createNode(data);
            parentNode.children.push(childNode);
            return childNode;
        }
    
        let isOK = true; // It's a 2-state (boolean) engine. Transitions implicitly only apply to OK-state
        const stack = [createNode("")]; // Stack is private, and always has Sentinel node with value ""
        // Create a map for the transitions table, for faster lookup
        // We use space as a delimiter for the key pair, assuming an input symbol does not include spaces.
        const transMap = new Map(transitions.map(({whenInput, whenStack, thenPushValue}) =>
            [whenInput + " " + whenStack, thenPushValue]
        ));
        const parser = {
            read(input) { // Returns true when parser can accept more input after this one
                // Once the engine is in an error state, it will not process any further inputs
                if (!isOK) {
                    return false;
                }
                // Consider the empty string as the end-of-input marker
                if (input === "") { 
                    // Even when state is OK, the stack should now also be "empty"
                    isOK &&= stack.length === 1;
                    return false; // Not an error condition, but indication that no more input is expected
                }
                // Transitions do not include state requirements, nor new state definitions.
                // It is assumed that a transition can only occur in an OK state, and that all 
                //    included transitions lead to an OK state.
                const pushValue = transMap.get(input + " " + stack.peek().data);
                if (pushValue === undefined) { // No matching transition in the table implies that state is not OK
                    isOK = false;
                } else {
                    // As this is a two-state engine, with defined transitions only between OK states,
                    // each defined transition will affect the stack: so it's either a push or pop.
                    // A push is identified by the (non-empy) value to be pushed. An empty string denotes a pop.
                    if (pushValue) {
                        stack.push(addChild(stack.peek(), pushValue));
                    } else {
                        stack.pop();
                    }
                }
                
                return isOK;
            },
            isOK, // Expose the (boolean) state
            output: stack[0].children // Don't expose the stack, but just the forest encoded in it
        };
        return parser;
    }
    
    function createTransition(whenInput, whenStack, thenPushValue) {
        return {whenInput, whenStack, thenPushValue}; // First two values imply the third
    }
    
    // Functions specific for the input format in the question:
    
    function createTransitionTable(types) {
        // Specific to the input structure (with + and - tags) given in the question
        // An empty string in the second column represents an empty stack
        return [
            // Transitions for opening tags: O(n²)
            ...["", ...types].flatMap(stack => 
                types.map(type => createTransition("+" + type, stack, type))
            ),
            // Transitions for closing tags
            ...types.map(type => createTransition("-" + type, type, ""))
        ];
    }
    
    function tokenizer(str) { // Could be a generator, but I went for a function-returning function
        str = str.replace(/\s+/g, ""); // remove white space from input string
    
        let current = 0; // Maintain state between `getNextToken` function calls
        
        function getNextToken() {
            const start = current++;
            while (current < str.length && str[current] !== "+" && str[current] !== "-") {
                current++;
            }
            const token = str.slice(start, current);
            
            console.log("read", token); // For debugging
            return token;
        }
        return getNextToken;
    }
    
    // Example input language, as defined in the question:
    const types = ["aaa", "bbb", "ccc", "ddd", "eee"];
    const transitionTable = createTransitionTable(types);
    const parser = createParser(transitionTable);
    
    // Example input for it:
    const rawInput = `
    +eee-eee
    +aaa+bbb-bbb-aaa
    +bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb`;
    const getNextToken = tokenizer(rawInput);
    
    // Process tokens
    while (parser.read(getNextToken())) {}
    
    console.log("Parser state is OK?: ", parser.isOK);
    console.log("Parser output:\n");
    console.log(JSON.stringify(parser.output, null, 3));