Search code examples
parsingantlrgrammarbibtex

BibTex grammar for ANTLR


I'm looking for a bibtex grammar in ANTLR to use in a hobby project. I don't want to spend my time for writing ANTLR grammar (this may take some time for me because it will involve a learning curve). So I'd appreciate for any pointers.

Note: I've found bibtex grammars for bison and yacc but couldn't find any for antlr.

Edit: As Bart pointed the I don't need to parse the preambles and tex in the quoted strings.


Solution

  • Here's a (very) rudimentary BibTex grammar that emits an AST (contrary to a simple parse tree):

    grammar BibTex;
    
    options {
      output=AST;
      ASTLabelType=CommonTree;
    }
    
    tokens {
      BIBTEXFILE;
      TYPE;
      STRING;
      PREAMBLE;
      COMMENT;
      TAG;
      CONCAT;
    }
    
    //////////////////////////////// Parser rules ////////////////////////////////
    parse
      :  (entry (Comma? entry)* Comma?)? EOF             -> ^(BIBTEXFILE entry*)
      ;
    
    entry
      :  Type Name Comma tags CloseBrace                 -> ^(TYPE Name tags)
      |  StringType Name Assign QuotedContent CloseBrace -> ^(STRING Name QuotedContent)
      |  PreambleType content CloseBrace                 -> ^(PREAMBLE content)
      |  CommentType                                     -> ^(COMMENT CommentType)
      ;
    
    tags
      :  (tag (Comma tag)* Comma?)?                      -> tag*
      ;
    
    tag
      :  Name Assign content                             -> ^(TAG Name content)
      ;
    
    content
      :  concatable (Concat concatable)*                 -> ^(CONCAT concatable+)
      |  Number
      |  BracedContent
      ;
    
    concatable
      :  QuotedContent
      |  Name
      ;
    
    //////////////////////////////// Lexer rules ////////////////////////////////
    Assign
      :  '='
      ;
    
    Concat
      :  '#'
      ;
    
    Comma
      :  ','
      ;
    
    CloseBrace
      :  '}'
      ;
    
    QuotedContent
      :  '"' (~('\\' | '{' | '}' | '"') | '\\' . | BracedContent)* '"'
      ;
    
    BracedContent
      :  '{' (~('\\' | '{' | '}') | '\\' . | BracedContent)* '}'
      ;
    
    StringType
      :  '@' ('s'|'S') ('t'|'T') ('r'|'R') ('i'|'I') ('n'|'N') ('g'|'G') SP? '{'
      ;
    
    PreambleType
      :  '@' ('p'|'P') ('r'|'R') ('e'|'E') ('a'|'A') ('m'|'M') ('b'|'B') ('l'|'L') ('e'|'E') SP? '{'
      ;
    
    CommentType
      :  '@' ('c'|'C') ('o'|'O') ('m'|'M') ('m'|'M') ('e'|'E') ('n'|'N') ('t'|'T') SP? BracedContent
      |  '%' ~('\r' | '\n')*
      ;
    
    Type
      :  '@' Letter+ SP? '{'
      ;
    
    Number
      :  Digit+
      ;
    
    Name
      :  Letter (Letter | Digit | ':' | '-')*
      ;
    
    Spaces
      :  SP {skip();}
      ;
    
    //////////////////////////////// Lexer fragments ////////////////////////////////
    fragment Letter
      :  'a'..'z'
      |  'A'..'Z'
      ;
    
    fragment Digit
      :  '0'..'9'
      ;
    
    fragment SP
      :  (' ' | '\t' | '\r' | '\n' | '\f')+
      ;  
    

    (if you don't want the AST, remove all -> and everything to the right of it and remove both the options{...} and tokens{...} blocks)

    which can be tested with the following class:

    import org.antlr.runtime.*;
    import org.antlr.runtime.tree.*;
    import org.antlr.stringtemplate.*;
    
    public class Main {
      public static void main(String[] args) throws Exception {
    
        // parse the file 'test.bib'
        BibTexLexer lexer = new BibTexLexer(new ANTLRFileStream("test.bib"));
        BibTexParser parser = new BibTexParser(new CommonTokenStream(lexer));
    
        // you can use the following tree in your code
        // see: http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1tree_1_1_common_tree.html
        CommonTree tree = (CommonTree)parser.parse().getTree();
    
        // print a DOT tree of our AST
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
      }
    }
    

    and the following example Bib-input (file: test.bib):

    @PREAMBLE{
      "\newcommand{\noopsort}[1]{} "
      # "\newcommand{\singleletter}[1]{#1} " 
    }
    
    @string { 
      me = "Bart Kiers" 
    }
    
    @ComMENt{some comments here}
    
    % or some comments here
    
    @article{mrx05,
      auTHor = me # "Mr. X",
      Title = {Something Great}, 
      publisher = "nob" # "ody",
      YEAR = 2005,
      x = {{Bib}\TeX},
      y = "{Bib}\TeX",
      z = "{Bib}" # "\TeX",
    },
    
    @misc{ patashnik-bibtexing,
           author = "Oren Patashnik",
           title = "BIBTEXing",
           year = "1988"
    } % no comma here
    
    @techreport{presstudy2002,
        author      = "Dr. Diessen, van R. J. and Drs. Steenbergen, J. F.",
        title       = "Long {T}erm {P}reservation {S}tudy of the {DNEP} {P}roject",
        institution = "IBM, National Library of the Netherlands",
        year        = "2002",
        month       = "December",
    }
    

    Run the demo

    If you now generate a parser & lexer from the grammar:

    java -cp antlr-3.3.jar org.antlr.Tool BibTex.g
    

    and compile all .java source files:

    javac -cp antlr-3.3.jar *.java
    

    and finally run the Main class:

    *nix/MacOS

    java -cp .:antlr-3.3.jar Main
    

    Windows

    java -cp .;antlr-3.3.jar Main
    

    You'll see some output on your console which corresponds to the following AST:

    enter image description here

    (click the image to enlarge it, generated with graphviz-dev.appspot.com)

    To emphasize: I did not properly test the grammar! I wrote it a while back and never really used it in any project.