Search code examples

ANTLR fuzzy parsing

I'm building a kind of pre-processor in ANTLRv3, which of course only works with fuzzy parsing. At the moment I'm trying to parse include statements and replace them with the corresponding file content. I used this example:

ANTLR: removing clutter

Based on this example, I wrote the following code:

grammar preprocessor;

options {

@lexer::header {

package antlr_try_1;


@parser::header {

package antlr_try_1;


 : (t=. {System.out.print($t.text);})* EOF

 : 'include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+
     setText("Include statement found!");

 : . // fall through rule, matches any character

This grammar does only for printing the text and replacing the include statements with the "Include statement found!" string. The example text to be parsed looks like this:

some random input
some random input
some random input

include some_file.txt

some random input
some random input
some random input

The output of the result looks in the following way:

C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 1:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 2:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 3:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 7:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 8:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 9:14 mismatched character 'p' expecting 'c'
some random ut
some random ut
some random ut

Include statement found!

some random ut
some random ut
some random ut

As far as I can judge, it is confused by the "in" in the word "input", because it "thinks" it would be the INCLUDE_STAT token.

Is there a better way to do it? The filter option I cannot use, since I need not only the include statements, but also the rest of the code. I've tried several other things, but couldn't find a proper solution.


  • You are observing one of ANTLR 3's limitations. You could use either of these options to correct the immediate problem:

    1. Upgrade to ANTLR 4, which does not have this limitation.
    2. Include the following syntactic predicate at the beginning of the INCLUDE_STAT rule:

      `('include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+) =>`