Search code examples
c#antlrantlr4antlr4cs

Filter ANTLR4 tokens before using them in parser


I'm trying to have a grammar that ignores whatever text is within undefined #if statements, like tokens from this string

#if UNDEFINED
  bla bla
#endif
Real Code here

should be transformed to the one below before being passed to parser

#if UNDEFINED
#endif
Real Code here

The idea I've got is to manually weed out IToken instances after they are produced by CommonTokenStream. Which works fine for me. However I'm trying to figure out what'd be the best way to pass filtered list of IToken instances to parser. I was initially looking at TokenStreamRewriter however this guy doesn't seem to do what I'm looking after.

Note: I can not use parser to do this job because parser wouldn't like illegal undefined content.

Right now I'm considering to create a CommonTokenStream subtype when I'd set tokens field manually and pass this instance to parser. But I'm not sure whether this is the best way.

Edit: Perhaps the sample is misleading, UNDEFINED is just an example, in practice it can be anything, even defined with #define statement.


Solution

  • This is an example, and may not be sufficiently robust for your purposes so I'll try to explain.

    lexer grammar MyLexer;
    
    channels { IGNORED }
    
    WS
        : [ \t\r\n]+
        -> channel(HIDDEN)
        ;
    
    OCTOTHORPE
        : '#'
        ;
    
    IFUNDEFINED
        : OCTOTHORPE [iI] [fF] WS [uU] [nN] [dD] [eE] [fF] [iI] [nN] [eE] [dD]
        -> mode(UNDEFINED_MODE)
        ;
    
    
    mode UNDEFINED_MODE;
    
    INTENTIONALLY_IGNORED
        : .*?
        ->channel(IGNORED)
        ;
    
    ENDIF
        : OCTOTHORPE [eE] [nN] [dD] [iI] [fF]
        ->mode(DEFAULT_MODE)
        ;
    

    Once you are in a lexer mode, only the tokens defined there will be emitted.

    This example does not handle, for example...

    #if UNDEFINED
        blah
    #if BOO
        blah blah
    #endif
        blah blah
    #endif
    

    The IFUNDEFINED rule will match the #if UNDEFINED text and switch to mode(UNDEFINED_MODE). While processing in this mode, the INTENTIONALLY_IGNORED rule will match everything and route it to the IGNORED channel but the nongreedy modifier "?" will allow the ENDIF rule to match the #endif text so a mode(DEFAULT_MODE) command can get us out of the UNDEFINED_MODE.

    Sadly, the example I've given shows that there may be more than one #endif.

    If you may have this situation, you must do something a bit more complicated...

    lexer grammar MyLexer;
    
    channels { IGNORED }
    
    WS
        : [ \t\r\n]+
        -> channel(HIDDEN)
        ;
    
    OCTOTHORPE
        : '#'
        ;
    
    IFUNDEFINED
        : OCTOTHORPE [iI] [fF] WS [uU] [nN] [dD] [eE] [fF] [iI] [nN] [eE] [dD]
        -> pushMode(UNDEFINED_MODE)
        ;
    
    
    mode UNDEFINED_MODE;
    
    INTENTIONALLY_IGNORED
        : .*?
        ->channel(IGNORED)
        ;
    
    HASHIF
        : OCTOTHORPE [iI] [fF] WS
        -> pushMode(UNDEFINED_MODE)
        ;
    
    ENDIF
        : OCTOTHORPE [eE] [nN] [dD] [iI] [fF]
        ->popMode
        ;
    

    The pushMode() command switches to the indicated mode and notes this on the mode stack. The popMode command switches to the previous mode and pops the current mode off the mode stack. So long as there are matching #if and #endif text in the source, this will work.

    ----------8<snip----------

    Edited per comment from OP...

    So, now we're going to do something that makes the grammar not portable to other target languages. We're going to embed some code in the lexer. Please pardon my Java.

    Here is MyLexer.g4

    lexer grammar MyLexer;
    
    @lexer::members {
        public java.util.HashSet definedStuff = new java.util.HashSet();
    }
    
    channels {
        IGNORED
    }
    
    WS
        : [ \t]+
        -> channel(HIDDEN)
        ;
    
    NEWLINE
        : [\r\n]
        ->channel(HIDDEN)
        ;
    
    OCTOTHORPE
        : '#'
        ;
    
    HASHDEFINE
        : OCTOTHORPE [dD] [eE] [fF] [iI] [nN] [eE]
        ->pushMode(HASHDEFINE_MODE)
        ;
    
    HASHUNDEF
        : OCTOTHORPE [uU] [nN] [dD] [eE] [fF]
        ->pushMode(HASHUNDEF_MODE)
        ;
    
    HASHIF
        : OCTOTHORPE [iI] [fF]
        -> pushMode(HASHIF_MODE)
        ;
    
    HASHENDIF
        : OCTOTHORPE [eE] [nN] [dD] [iI] [fF]
        ;
    
    REAL_STUFF
        : .+?
        ;
    
    /*
    The rest of your grammar goes here.
    */
    
    mode HASHDEFINE_MODE;
    
    DEFINED_TOKEN
        : ~[ \n\r\t]+ //add other characters as needed
        {
            definedStuff.add(getText());
        }
        ;
    
    HD_WS
        : WS
        -> channel(HIDDEN)
        ;
    
    HD_NEWLINE
        : NEWLINE
        ->channel(HIDDEN),popMode
        ;
    
    mode HASHUNDEF_MODE;
    
    UNDEFINED_TOKEN
        : DEFINED_TOKEN
        {
            definedStuff.remove(getText());
        }
        ;
    
    HU_WS
        : WS
        -> channel(HIDDEN)
        ;
    
    HU_NEWLINE
        : NEWLINE
        ->channel(HIDDEN),popMode
        ;
    
    mode HASHIF_MODE;
    
    HI_WS
        : WS
        -> channel(HIDDEN)
        ;
    
    IF_TOKEN
        : DEFINED_TOKEN
        {
            if (definedStuff.contains(getText())) {
                /*
                In this case DEFINED_TOKEN is in fact defined so we get
                back into the mode we were in before the pushMode that 
                brought us here.
                */
                popMode();
            } else {
                pushMode(IGNORE_MODE);
            }
        }
        ;
    
    mode IGNORE_MODE;
    
    I_HASHIF
        : HASHIF
        -> pushMode(HASHIF_MODE),type(HASHIF)
        ;
    
    I_HASHENDIF
        : HASHENDIF
        ->type(HASHENDIF),popMode,popMode
        ;
    
    INTENTIONALLY_IGNORED
        : .+?
        ->channel(IGNORED)
        ;
    

    Here is MyParser.g4

    parser grammar MyParser;
    
    options {tokenVocab=MyLexer;}
    
    startRule
        : (hashDefine
        | hashUndef
        | hashIfBlock
        | actualCode
        )*
        EOF
        ;
    
    hashDefine
        : HASHDEFINE DEFINED_TOKEN
        ;
    
    hashUndef
        : HASHUNDEF UNDEFINED_TOKEN
        ;
    
    hashIf
        : HASHIF IF_TOKEN
        ;
    
    hashEndif
        : HASHENDIF
        ;
    
    hashIfBlock
        : hashIf
        (INTENTIONALLY_IGNORED+ | actualCode+ | hashIfBlock)*
        hashEndif
        ;
    
    actualCode
        : REAL_STUFF+
        ;
    

    And, here is the test I ran through the generated code.

    #define x
    #if x
    some code here
    #endif
    #if y
    this should be ignored
    #endif
    #if y
    this should also be ignored
    #if x
    this is tricky
    #endif
    #endif
    #if y
    #if z
    also tricky
    #endif
    more trickiness
    #endif
    and finally some actual code again
    

    Which gave what you would expect.

    How does this work? We embed a HashSet named definedStuff in the generated code. When we see #define we switch to HASHDEFINE_MODE and add the token being defined to the definedStuff HashSet. When we see #undef we switch to HASHUNDEF_MODE and remove the token from the definedStuff HashSet.

    When we see a #if we switch to HASHIF_MODE and check to see if the token whose existence is being tested is present in definedStuff. If it is, we pushMode(IGNORE_MODE), if it isn't, we popMode().

    From IGNORE_MODE, if we see a #if we pushMode(HASHIF_MODE) and again see if the token is present in definedStuff.

    We get out of IGNORE_MODE when we see #endif, and we do two popModes because we got there via two pushModes, one for HASHIF_MODE and one for IGNORE_MODE.

    The REAL_STUFF lexer rule and actualCode parser rule are just there to make the test work.