Search code examples
c#parsingincludeantlr4antlr4cs

C# and ANTLR4: Handling "include" directives when parsing a file


I’m in a situation that, using ANTLR, I’m trying to parse input files that contains references to other files inside them, just like #include "[insert file name]" of C language.

One suggested approach is:

  1. Parse the root file, saving said references as nodes (so, specific Grammar rules)
  2. Visit the tree searching for "reference" nodes
  3. for each reference node, parse the file referenced and substitute the node with the newly generated tree
  4. repeat this process recursively, to handle multiple levels of inclusions

The problem with this solution is that the referenced files might be completely partial (see includes inside the body of a C function). In order to parse such files, I would have to implement a different parser to handle the fragmented grammar.

Is there any valid/suggested approach to (literally) inject the new file inside the ongoing parsing process?


Solution

  • One solution to this problem can be achieved by overriding Scanner's behavior and specifically, the NextToken() method. This is necassary since the EOF token cannot be handled by the ANTLR lexer grammar ( to my best knowledge ) and any actions attached to the lexer rule recognizing the EOF are simply ignored (as shown in the code bellow). Thus, it is necessary to implement this behaviour directly into the scanner method.

    So assume we have a parser grammar

    parser grammar INCParserGrammar;
    
    @parser::members {
            public static Stack<ICharStream> m_nestedfiles = new Stack<ICharStream>();
    }
    
    options { tokenVocab = INCLexerGrammar; }
    
    /*
     * Parser Rules
     */
    
    compileUnit
        :   (include_directives | ANY )+ ENDOFFILE
        ;
    
    include_directives : INCLUDEPREFIX FILE DQUOTE
                         ;
    

    A static public Stack<ICharStream> (i.e. mySpecialFileStack) should be introduced inside grammar's members. This stack will be used to store the Character Steams associated with the files that take part in the parsing. The Character Streams are push to this stack as new files are encountered with the include statements

    and a lexer grammar

       lexer grammar INCLexerGrammar;
    
       @lexer::header {
        using System;
        using System.IO;
       }
    
       @lexer::members { 
        string file;
        ICharStream current;
        
       }
    
    
    /*
     * Lexer Rules
     */
    INCLUDEPREFIX : '#include'[ \t]+'"' {                                                 
                                          Mode(INCLexerGrammar.FILEMODE);
                                        };
    
    // The following ruls has always less length matched string that the the rule above
    ANY : ~[#]+ ;
    
    
    ENDOFFILE : EOF { // Any actions in the this rule are ignored by the ANTLR lexer };
    
    
    ////////////////////////////////////////////////////////////////////////////////////////////////////////
    
    mode FILEMODE;
    FILE : [a-zA-Z][a-zA-Z0-9_]*'.'[a-zA-Z0-9_]+ {  file= Text;
                                                    StreamReader s = new StreamReader(file);
                                                    INCParserGrammar.m_nestedfiles.Push(_input);                                                
                                                    current =new AntlrInputStream(s);           
                                                
                                                 };
    DQUOTE: '"'  {  
                    this._input = current;
                    Mode(INCLexerGrammar.DefaultMode);  };
    

    The overriden body of NextToken() method will be placed in the .g4.cs file which purpose is to extend the generated scanner class given that the generated scanner class is decorated with the "partial" keyword

    After the partial Scanner Class associated to the given grammar is generated navigate to the source code of the ANTLR4 Lexer Class as given in the ANTLR Runtime and Copy ALL the original code to this new method and, in the middle do-while block (right after the try-catch block) add the following code:

    if (this._input.La(1) == -1)
    {
        if ( mySpecialFileStack.Count == 0 )
            this._hitEOF = true;
        else
            this._input = mySpecialFileStack.Pop();
    }
    

    The full body of the NextToken() method override is

    public override IToken NextToken() {
                int marker = this._input != null ? this._input.Mark() : throw new InvalidOperationException("nextToken requires a non-null input stream.");
                label_3:
                try {
                    while (!this._hitEOF) {
                        this._token = (IToken)null;
                        this._channel = 0;
                        this._tokenStartCharIndex = this._input.Index;
                        this._tokenStartCharPositionInLine = this.Interpreter.Column;
                        this._tokenStartLine = this.Interpreter.Line;
                        this._text = (string)null;
                        do {
                            this._type = 0;
                            int num;
                            try {
                                num = this.Interpreter.Match(this._input, this._mode);
                            } catch (LexerNoViableAltException ex) {
                                this.NotifyListeners(ex);
                                this.Recover(ex);
                                num = -3;
                            }
    
                            if (this._input.La(1) == -1) {
                                if (INCParserGrammar.m_nestedfiles.Count == 0 ) {
                                    this._hitEOF = true;
                                }
                                else
                                {
                                    this._input = INCParserGrammar.m_nestedfiles.Pop();
                                }
                            }
    
                            if (this._type == 0)
                                this._type = num;
                            if (this._type == -3)
                                goto label_3;
                        }
                        while (this._type == -2);
                        if (this._token == null)
                            this.Emit();
                        return this._token;
                    }
                    this.EmitEOF();
                    return this._token;
                } finally {
                    this._input.Release(marker);
                }
            }
    
    

    Now, when you recognize a file inside your code that should be parsed, simply add the following action

    FILE
        : [a-zA-Z][a-zA-Z0-9_]*'.'[a-zA-Z0-9_]+ {
            StreamReader s = new StreamReader(Text);
            mySpecialFileStack.Push(_input);                                                
            _input = new AntlrInputStream(s);                                               
        };
        
    DQUOTE: '"'  {  this._input = current;
                Mode(INCLexerGrammar.DefaultMode);  };
    //***Warning:***
    // Be careful when your file inclusion is enclosed inside quotes or other symbols, or if  
    // the filename-to-be-included is not the last token that defines an inclusion: `_input`  
    // should only be switched AFTER the inclusion detection is completely found (i.e. after  
    // the closing quote has been recognized).  
    
    

    Finally the main program is given below where it is apparent that the root file is added first in the ICharStream stack

     static void Main(string[] args) {
                var a = new StreamReader("./root.txt");
                var antlrInput = new AntlrInputStream(a);
                INCParserGrammar.m_nestedfiles.Push(antlrInput);
                var lexer = new INCLexerGrammar(antlrInput);
                var tokens = new BufferedTokenStream(lexer);
                var parser = new INCParserGrammar(tokens);
                parser.compileUnit();
                
            }