Parser generator for inline documentation

To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.

http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?

Solution

If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:

lexer grammar Foo;

options {filter=true;}

StringLiteral
  :  ...
  ;

CharLiteral
  :  ...
  ;

SingleLineComment
  :  ...
  ;

MultiLineComment
  :  ...
  ;

When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.

Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!