Search code examples
javaparsinglalr

Source code parsing and macro-like handling of similar constructions


TL;DR VERSION: Is there a parser generator that supports the following: when some rule is reduced (I assume LALR(1) parser), then reduction isn't performed, but parser backs off and replaces input with different code using values from this rule and parses that code. Repeat if needed. So if code is "i++" and rule is "expr POST_INCR", I can do more or less:

expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"

So basicaly code rewriting using macros?

LONG VERSION:

I wrote yet another simple interpreted language (in Java for simplicity). It works ok, but it raised some question. Introduction is pretty long, but simple and helps to shows my problem clearly (I think).

I have "while" loop. It is pretty simple, given:

WHILE LPARE boolean_expression RPAREN statement

I generate more or less the following:

new WhileNode(boolean_expression, statement); 

This creates new node that, when visited later, generates code for my virtual machine. But I also have the following:

FOR LPAREN for_init_expr SEMICOLON boolean_expression SEMICOLON for_post_expr RPAREN statement

This is "for loop" known from Java or C. From aforementioned rule I create more or less the following:

new ListNode(
    for_init_expr,
    new WhileNode(
        boolean_expression,
        new ListNode(
            statement,
            new ListNode(for_post_expr, null))))

This is of course simple transformation, from:

for (for_init ; boolean_expression ; for_post_expr)
    statement

to:

for_init
while (boolean_expression) {
    statement
    for_post_expr;
}

All is fine and dandy, but things get hairy for the following:

FOR LPAREN var_decl COLON expression RPAREN statement

This if well known and liked:

for (int x : new int[] { 1, 2 })
    print(x);

I refrain from posting code that generates AST, since basic for loop was already a little bit long, and what we get here is ever worse. This construction is equal to:

int[] tmp = new int[] { 1, 2 };
for (int it = 0 ; it < tmp.length; it = it + 1) {
    int x = tmp[it];
    print(x);
}

And since I'm not using types, I simply assume that "expression" (so right side, after COLON) is something that I can iterate over (and arrays are not iterable), I call a function on a result of this "expression" which returns instance of Iterable. So, in fact, my rewritten code isn't as simple as one above, is more or less this:

Iterator it = makeIterable(new int[] { 1, 2 });
while (it.hasNext()) {
    int x = it.next();
    print(x);
}

It doesn't look THAT bad, but note that AST for this generates three function calls and while loop. To show you what mess it is, I post what I have now:

113     | FOR LPAREN var_decl_name.v PIPE simple_value_field_or_call.o RPAREN statement.s
114                                         {: Symbol sv = ext(_symbol_v, _symbol_o);
115                                            String autoVarName = generateAutoVariableName();
116                                            Node iter = new StatementEndNode(sv, "",
117                                                new BinNode(sv, CMD.SET, "=",
118                                                    new VarDeclNode(sv, autoVarName),
119                                                    new CallNode(sv, "()",
120                                                        new BinNode(sv, CMD.DOT, ".",
121                                                            new MkIterNode(sv, o),
122                                                            new PushConstNode(sv, "iterator")))));
123                                            Node varinit = new StatementEndNode(sv, "",
124                                                new BinNode(sv, CMD.SET, "=",
125                                                    v,
126                                                    new PushConstNode(sv, "null")));
127                                            Node hasnext = new CallNode(sv, "()",
128                                                new BinNode(sv, CMD.DOT, ".",
129                                                    new VarNode(sv, autoVarName),
130                                                    new PushConstNode(sv, "hasNext")));
131                                            Node vargennext = new StatementEndNode(sv, "",
132                                                new BinNode(sv, CMD.SET, "=",
133                                                    new VarNode(sv, v.name),
134                                                    new CallNode(sv, "()",
135                                                        new BinNode(sv, CMD.DOT, ".",
136                                                            new VarNode(sv, autoVarName),
137                                                            new PushConstNode(sv, "next")))));
138                                            return new ListNode(sv, "",
139                                                new ListNode(sv, "",
140                                                    new ListNode(sv, "",
141                                                        iter
142                                                    ),
143                                                    varinit
144                                                ),
145                                                new WhileNode(sv, "while",
146                                                    hasnext,
147                                                    new ListNode(sv, "",
148                                                        new ListNode(sv, "",
149                                                            vargennext
150                                                        ),
151                                                        s)));

To answer your questions: yes, I am ashamed of this code.

QUESTION: Is there are parser generator that let's me do something about it, namely given rule:

FOR LPAREN var_decl COLON expr RPAREN statement

tell parser to rewrite it as if it was something else. I imagine that this would require some kind of LISP's macro mechanism (which is easy in lisp due to basically lack of grammar whatsoever), maybe similar to this:

FOR LPAREN var_decl COLON expr RPAREN statement =
    {   with [ x = generateAutoName(); ]
        emit [ "Iterator $x = makeIterable($expr).iterator();"
               "while (${x}.hasNext()) {"
               "$var_decl = ${x}.next();"
               "$statement"
               "}"
        ]
    }

I don't know if this is a well known problem or not, I simply don't even know what to look for - the most similar question that I found is this one: Any software for pattern-matching and -rewriting source code? but it isn't anywhere close to what I need, since it is supposed to work as a separate step and not during compilation, so it doesn't qualify.

Any help will be appreciated.


Solution

  • I think you are trying to bend the parser too much. You can simply build the tree with macro in it, and then post-process the tree to replace the macros with whatever substitution you want.

    You can do this by walking the resulting tree, detecting the macro nodes (or places where you want to do substitutions), and simply splicing in replacements with procedural tree hackery. Not pretty but workable. You should able to do this with the result of any parser generator/AST building machinery.

    If you want a more structured approach, you could build your AST and then use source-to-source transformations to "rewrite" the macros to their content. Out DMS Software Reengineering Toolkit can do this, you can read more details about what the transforms look like.

    Using the DMS approach, your concept:

    expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"
    

    requires that you parse the original text in the conventional way with the grammar rule:

     term = expr POST_INCR ;
    

    You would give all these grammar rules to DMS and let it parse the source and build your AST according to the grammar.

    Then you apply a DMS rewrite to the resulting tree:

    domain MyToyLanguage; -- tells DMS to use your grammar to process the rule below
    
    rule replace_POST_INCR(e: expr): term->term
      = "\e POST_INCR" -> " (tmp = \e; \e = \e + 1; tmp) ";
    

    The quote marks here are "domain meta quotes" rather than string literal quotes. The text outside the double-quotes is DMS rule-syntax. The text inside the quotes is syntax from your language (MyToyLangauge), and is parsed using the parser you provided, some special escapes for pattern variables like \e. (You don't have to do anything to your grammar to get this pattern-parsing capability; DMS takes care of that).

    By convention with DMS, we often name literal tokens like POST_INCR with a quoted equivalent '++' in the lexer, rather than using such a name. Instead of

    #token POST_INCR "\+\+"
    

    the lexer rule then looks like:

    #token '++'  "\+\+"
    

    If you do that, then your grammar rule reads like:

    term = expr '++' ;
    

    and your rewrite rule now looks like:

    rule replace_POST_INCR(e: expr): term->term
      = "\e++" -> " (tmp = \e; \e = \e + 1; tmp) ";
    

    Following this convention, the grammar (lexer and BNF) is (IMHO) a lot more readable, and the rewrite rules are more readable too, since they stay extremely close to the actual language syntax.