Search code examples
compiler-construction

What is the procedural way to arrange lex and parse files?


I searched quite a bit, and still can't seem to grasp the work flow, and file arranging of tokens, the lexer, and the parser. I'm using plain C with Visual Studio. Went through countless tutorials, and they seem to skip this part.

So when you write a simple function in the source.txt, the lexer is in a C file, reads the source file, and breaks the simple function into tokens?

In Dr. Dobbs book, there is a file that contains a list of already defined tokens that is written into a def file, if the previous question is true, where does this predefined token list fit in?

Once the source file is read by lex, and the tokens are split, how does the parser retrieve the tokenized info lex is putting out, and then make the grammar?

I've seen typedef struct used for what seemed like a slight breakdown of predefined tokens, and somewhat of the lex inside. So are the tokens in 1 file, the lex in 1, and the parser in 1? Or the tokens in 1 but lex and parse together calling that token file?

I just need clarification, a good point in the right direction, any help is much appreciated.


Solution

  • A simple classic way to organize these tasks is by setting the various parts up as subroutines passing datastructures to one another. Forgive my sloppy syntax.

    First you need a definitino of a token, produced by the lexer. This is almost always a struct with an enum to indicate which token type, and a union type to carry any value the token type may have:

     struct Token {
         enum  // distinguishes token types
              { EndOfFile, Integer, String, Float, Identifier
                Semicolon, Colon, Plus, Minus, Times, Divide, LefParen, RightParen,
                KeywordBegin, KeywordDeclare, KeywordEnd, ...
              } tokentype
          union {
             long numeric_value; // holds numeric value of integer-valued token
             char* string_value; // holds ptr to string body of identifiers or string literals
             float float_value; // holds numeric value of floating-point valued token
                } tokenvalue
          }
    

    You'll want to build an abstract syntax tree. For that you'll need a TreeNode type. Like Tokens, these are almost always an enum to indicate which node type, and a union to hold various properties of the node type, and finally a list of pointers to children:

          struct TreeNode {
              enum // distiguishes tree node types
                 { Goal, Function, StatementList, Statement, LeftHandSide, Expression,
                   Add, Subtract, Times, Divide, Identifier, FunctionCall, ...
                 } nodetype
              children* TreeNode[4];  // a small constant here is almost always enough
              union // hold values specific to node type, often includes a copy of lexer union
                 { long numeric_value; // holds:
                           // numeric value of integer-valued token
                           // index of built-in function number
                           // actual number of children if it varies
                           // ...
                   char* string_value; // holds ptr to string body of identifiers or string literals
                   float float_value; // holds numeric value of floating-point valued token
                } nodevalue
            }
    

    MyCompiler.c contains the following code:

         int main() {
                filehandle Handle = OpenSourceFile(&FileName);
                ASTrootnode TreeNode = Parser(Handle);
                CodeGenerator(ASTrootnode);
                exit();
         }
    

    Parser.c contains the following code:

         TreeNode Parser(filehandle Handle) {
                <parsingmachinery>
                nexttoken Token=Lexer(filehandle);
                <moreparsingmachinery to build tree nodes>
                return toplevel_TreeNode;
         }
    

    Lexer.c contains the following code:

        Token Lexer(filehandle Handle) {
             token Token;
             <process characters in buffer>
             if bufferp=end_of_buffer
                fileread(filehandle,&bufferbody,bufferlength);
             <process characters in buffer>
             token.tokentype=<typeofrecognizedtoken>
             token.tokenvalue.long=<valueofnumerictoken>
             ...
             return Token;
        }
    

    You'll obviously want to put the Token and TreeNode declarations into header files that can be shared across your compiler source files.

    If you build a high-performance compiler, you'll want to optimize these routines. One trivial example: the FileHandle can become a global variable, and thus need not be passed between the pieces as an explicit argument. One not so trivial example: you'll want to a high-performance lexer generator, or hand-code the lexer, to maximize its character processing rate, especially at skipping blanks and comments.

    If you want to see specific details on how to construct a parser that builds ASTs, see my SO answer on building recursive descent parsers: https://stackoverflow.com/a/2336769/120163