We desire to create a pushdown automaton (PDA) that uses the following "alphabet" (by alphabet I mean a unique set of symbol strings/keys):
+aaa
+bbb
+ccc
+ddd
+eee
-aaa
-bbb
-ccc
-ddd
-eee
The "symbols" (3-letter sequences with a + or - prefix) in this alphabet are used to make trees. Here are some example trees:
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb
Visualized as an actual tree, they would be more like:
+eee
-eee
+aaa
+bbb
-bbb
-aaa
+bbb
+aaa
+ccc
+eee
-eee
-ccc
+ccc
-ccc
+ddd
+ccc
+eee
-eee
-ccc
-ddd
-aaa
-bbb
So given this alphabet and these example trees, the question is how you would write a generic pushdown automaton to parse these trees. The rules are:
How would you write a pushdown automaton in JavaScript to parse a string into an AST?
By this I mean, the implementation must literally have a stack
, states
, and transitions
. By this I mean, not implementing an ad-hoc parser, or even a recursive descent parser. This should be an iterative while loop looping through transitions somehow, not using recursion.
The output should be a very simple "AST" that just looks like this (for the +aaa+bbb-bbb-aaa
tree):
{
"type": "aaa",
"children": [
{
"type": "bbb",
"children": []
}
]
}
I am wondering how to build a PDA so I can, in my specific case working on a custom programming language, convert a rather involved and complex AST I am working and parse it into an object graph. That question is too complicated to write out on SO, and too hard to simplify, which is why I am asking here about a very basic PDA implementation in JavaScript.
I am interested to see how you keep track of the context at each branch point when doing the PDA algorithm and what the transitions look like.
Note: I've heard of / seen "2-state" PDAs occasionally mentioned here and there. It sounds like you have one state to do the computation, and one "final" state. Maybe even just a single state. If it's possible to do it sort of like this, then that would be cool. If not, don't worry about it.
If necessary, you can first write a Context Free Grammar and use that to dynamically construct a PDA, but the key question here is how the PDA is technically implemented. I think it's probably possible / straightforward to just write each transition function by hand, but I'm not entirely sure. Either way works for me.
It is not possible to create a PDA for producing this tree data structure because:
So, as others have also noted, we need to loosen the rules somehow. We can think of several (combinations of) ways to do that:
Some of these options would infer an infinite transition table, and so in that case it cannot be a table, but could be a function.
Considering your focus on tree-building, and the requirement to have "a stack, states, and transitions", I went for the following implementation choices, and sometimes simplifications:
data
property of a stack element will be the special attribute that will serve for identifying transitions. The children
attribute of each stack element will serve to define the tree and is not considered part of the stack alphabet that the transition table targets.children
attribute will be the output of the engine. At every step this output can be consulted, since this element is never removed from the stack.The more specific part of the solution deals with the actual input format you have given in the question. One function will covert the set of types to a transition table, and another will tokenize the input based on the "+" and "-" characters, but without any validation (no symbol check, nor symbol length check, ...), as this will all surface as error anyway when the engine is called.
Any white space in the input is ignored.
// To peek the top element of a stack:
Array.prototype.peek = function () { return this[this.length - 1] };
function createParser(transitions) {
function createNode(data) {
return {
data,
children: []
};
}
function addChild(parentNode, data) {
const childNode = createNode(data);
parentNode.children.push(childNode);
return childNode;
}
let isOK = true; // It's a 2-state (boolean) engine. Transitions implicitly only apply to OK-state
const stack = [createNode("")]; // Stack is private, and always has Sentinel node with value ""
// Create a map for the transitions table, for faster lookup
// We use space as a delimiter for the key pair, assuming an input symbol does not include spaces.
const transMap = new Map(transitions.map(({whenInput, whenStack, thenPushValue}) =>
[whenInput + " " + whenStack, thenPushValue]
));
const parser = {
read(input) { // Returns true when parser can accept more input after this one
// Once the engine is in an error state, it will not process any further inputs
if (!isOK) {
return false;
}
// Consider the empty string as the end-of-input marker
if (input === "") {
// Even when state is OK, the stack should now also be "empty"
isOK &&= stack.length === 1;
return false; // Not an error condition, but indication that no more input is expected
}
// Transitions do not include state requirements, nor new state definitions.
// It is assumed that a transition can only occur in an OK state, and that all
// included transitions lead to an OK state.
const pushValue = transMap.get(input + " " + stack.peek().data);
if (pushValue === undefined) { // No matching transition in the table implies that state is not OK
isOK = false;
} else {
// As this is a two-state engine, with defined transitions only between OK states,
// each defined transition will affect the stack: so it's either a push or pop.
// A push is identified by the (non-empy) value to be pushed. An empty string denotes a pop.
if (pushValue) {
stack.push(addChild(stack.peek(), pushValue));
} else {
stack.pop();
}
}
return isOK;
},
isOK, // Expose the (boolean) state
output: stack[0].children // Don't expose the stack, but just the forest encoded in it
};
return parser;
}
function createTransition(whenInput, whenStack, thenPushValue) {
return {whenInput, whenStack, thenPushValue}; // First two values imply the third
}
// Functions specific for the input format in the question:
function createTransitionTable(types) {
// Specific to the input structure (with + and - tags) given in the question
// An empty string in the second column represents an empty stack
return [
// Transitions for opening tags: O(n²)
...["", ...types].flatMap(stack =>
types.map(type => createTransition("+" + type, stack, type))
),
// Transitions for closing tags
...types.map(type => createTransition("-" + type, type, ""))
];
}
function tokenizer(str) { // Could be a generator, but I went for a function-returning function
str = str.replace(/\s+/g, ""); // remove white space from input string
let current = 0; // Maintain state between `getNextToken` function calls
function getNextToken() {
const start = current++;
while (current < str.length && str[current] !== "+" && str[current] !== "-") {
current++;
}
const token = str.slice(start, current);
console.log("read", token); // For debugging
return token;
}
return getNextToken;
}
// Example input language, as defined in the question:
const types = ["aaa", "bbb", "ccc", "ddd", "eee"];
const transitionTable = createTransitionTable(types);
const parser = createParser(transitionTable);
// Example input for it:
const rawInput = `
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb`;
const getNextToken = tokenizer(rawInput);
// Process tokens
while (parser.read(getNextToken())) {}
console.log("Parser state is OK?: ", parser.isOK);
console.log("Parser output:\n");
console.log(JSON.stringify(parser.output, null, 3));