Search code examples
c++compiler-constructionantlrantlr3translate

How to implement a computer language translator using two separate grammars


I want to create a computer language translator between two languages LANG1 and LANG2. More specifically, I want to translate code written in LANG1 into source code in LANG2.

I have the BNF Grammar for both LANG1 and LANG2.

LANG1 is a small DSL I wrote by myself, and is essentially, an "easier" version of LANG2.

I want to be able to generate statements in LANG2, from input statements written in LANG1.

I am in the process of compiling a compiler for LANG1, but I don't know what to do next (in order to translate LANG1 statements to LANG2 statements).

My understanding of the steps involved are as follows:

1. BNF for my DSL (LANG1)                                  DONE
2. Reverse engineered the BNF for LANG2                    DONE
3. Learning how to generate a compiler for LANG1           TODO
4. Translate LANG1 statements to LANG2 statements          ???

What are the steps involved in order to generate LANG2 statements from LANG1 statements?

My code base is in C++, so I can use Parser generated in either C or C++.

PS: I will be using ANTLR3 to generate the compiler for LANG1


Solution

  • In the most general case you have to translate each possible section of the grammar from LANG1 into something appropriate for LANG2, or you have to dumb down into the simplest possible primitives of both languages, like assembly or combinators. This is a little time consuming and not very fun.

    However, if the grammars are equivalent or share a lot of common ground, you might just be able to get away with just parsing into the same tree for both grammars and having output functions that can take your standardized tree and convert it back into LANG1 or LANG2 source (which is mostly the same as the general case but takes a lot more short cuts).

    EDIT: Since I've just reread your question realized that you only want to translate one way, you need only worry about making the form of the tree suit LANG1 and just have your translate function for LANG2. But I hope my example is helpful anyway.

    Example Tree construction:

    Here are two different ANTLR grammars that produce the same extremely simple AST:

    Grammar 1

    The first one is the standard way of expressing addition:

    grammar simpleAdd;
    
    options {output=AST;}
    
    tokens {
        PLUS = '+';
    }
    
    expression : addition EOF!;
    addition   : NUMBER (PLUS NUMBER)+ -> ^(PLUS NUMBER*);
    
    NUMBER     :'0'..'9'+;
    WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+    { $channel = HIDDEN; } ;
    

    This will accept two or more integers and produce a tree with a PLUS node and all the numbers in a list to be added together. E.g.,

    1 + 1 + 2 + 3 + 5
    

    Grammar 2

    The second grammar takes a less elegant form:

    grammar horribleAdd;
    
    options {output=AST;}
    
    tokens {
        PLUS = '+';
        PLUS_FUNC = 'plus';
        COMMA = ',';
        LEFT_PARENS ='(';
        RIGHT_PARENS=')';
    } 
    
    expression : addition EOF!;
    addition   : PLUS_FUNC LEFT_PARENS NUMBER (COMMA NUMBER)+ RIGHT_PARENS -> ^(PLUS NUMBER*);
    
    NUMBER     :'0'..'9'+;
    WHITESPACE : ( '\t' | ' ' | '\r' | '\n')+    { $channel = HIDDEN; } ;
    

    This grammar expects numbers given to a function (yes, I know functions don't really work like this, I'm just trying to keep the example as clear as possible). E.g.,

    plus(1, 1, 2, 3, 5)
    

    It produces exactly the same tree as the first grammar (a PLUS node with the numbers as children).

    Now you don't need to worry what language your instructions came from, you can output it in whatever form you like. All you need to do is write a function that can convert this AST back into a language of your choosing.