Search code examples
antlrmediawikistringtemplatewikitextcreole

Working example of wikitext-to-HTML in ANTLR 3


I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.

Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.

Here are the links to two grammars I've tried using:

I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.

EDIT: Okay I've got a very basic grammar:

grammar wikitext;

options {
  //output = AST;
  //ASTLabelType = CommonTree;
  output = template;
  language = Java;
}

document: line (NL line?)*;

line: horizontal_line | list | heading | paragraph;

/* horizontal line */
horizontal_line: HRLINE;

/* lists */
list: unordered_list | ordered_list;

unordered_list: '*'+ content;
ordered_list: '#'+ content;

/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;

/* Paragraph */
paragraph: content;

content: (formatted | link)+;

/* links */
link: external_link | internal_link;

external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;

external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;

/* bold & italic */
formatted: bold_italic | bold | italic | plain;

bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;

/* Plain text */
plain: (CHARACTER | SPACE)+;


/**
 * LEXER RULES
 * --------------------------------------------------------------------------
 */

HRLINE: '---' '-'+;

H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';

BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';

NL: '\r'?'\n';

CHARACTER       :       '!' | '"' | '#' | '$' | '%' | '&'
                |       '*' | '+' | ',' | '-' | '.' | '/'
                |       ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
                |       '0'..'9' | 'A'..'Z' |'a'..'z' 
                |       '\u0080'..'\u7fff'
                |       '(' | ')'
                |       '\'' | '<' | '>' | '=' | '[' | ']' | '|' 
                ;

SPACE: ' ' | '\t';

It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?


Solution

  • Okay, after your EDIT, I have a couple of recommendations.

    Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a table would be tokenized as: TABLE : '{|' .* '|}' and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.

    About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with { and }.

    Here's a quick demo:

    grammar T;
    
    @parser::members {
    
      // an attribute that is only available in your 
      // parser (so only in parser rules!)
      protected StringBuilder htmlBuilder = new StringBuilder();
    }
    
    // Parser rules
    parse
      :  atom+ EOF
      ;
    
    atom
      :  header
      |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
      ;
    
    header
      :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
      |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
      |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
      ;
    
    h3Content : ~H3*; // match any token except H3, zero or more times
    h2Content : ~H2*; //        "               H2          "
    h1Content : ~H1*; //        "               H1          "
    
    // Lexer rules    
    H3 : '===';
    H2 : '==';
    H1 : '=';
    
    // Fall through rule: if non of the above 
    // lexer rules matched, this one will.
    Any
      :  .
      ;
    

    From that grammar, you generate a parser and lexer:

    java -cp antlr-3.2.jar org.antlr.Tool T.g
    

    and then create a little class to test your parser:

    import org.antlr.runtime.*;
    
    public class Main {
        public static void main(String[] args) throws Exception {
    
            // the source to be parsed
            String source = 
                    "= header 1 =             \n"+
                    "                         \n"+
                    "some text here           \n"+
                    "                         \n"+
                    "=== header level 3 ===   \n"+
                    "                         \n"+
                    "and some more text         ";
    
            ANTLRStringStream in = new ANTLRStringStream(source);
            TLexer lexer = new TLexer(in);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            TParser parser = new TParser(tokens);
    
            // invoke the start-rule in your parser
            parser.parse();
    
            // print the contents of your parser's StringBuilder
            System.out.println(parser.htmlBuilder);
        }
    }
    

    and then compile all your source files:

    javac -cp antlr-3.2.jar *.java
    

    and finally, run your main class

    // *nix & MacOS
    java -cp .:antlr-3.2.jar Main
    
    // Windows
    java -cp .;antlr-3.2.jar Main
    

    which will print the following to the console:

    <h1> header 1 </h1>             
    
    some text here           
    
    <h3> header level 3 </h3>   
    
    and some more text  
    

    But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.

    Anyway, whatever you do: best of luck!