Search code examples
c++flex-lexer

Change yylex in C++ Flex


I want to change yylex to alpha_yylex, that also takes in a vector as an argument.

.
.
#define YY_DECL int yyFlexLexer::alpha_yylex(std::vector<alpha_token_t> tokens)
%}
.
.
. in main()
std::vector<alpha_token_t> tokens;
while(lexer->alpha_yylex(tokens) != 0) ;

I think i know why this fails, because obviously in the FlexLexer.h there is NO alpha_yylex , but i don't know how to achieve what i want...

How can I make my own alpha_yylex() or modify the existing one?


Solution

  • It's true that you cannot edit the definition of yyFlexLexer, since FlexLexer.h is effectively a system-wide header file. But you can certainly subclass it, which will provide most of what you need.

    Subclassing yyFlexLexer

    Flex allows you to use %option yyclass (or the --yyclass command-line option) to specify the name of a subclass, which will be used instead of yyFlexLexer to define yylex. Subclassing yyFlexLexer allows you to include your own header which defines your subclass' members and maybe even additional functions, as well as its constructors; in short, if your intention was simply to fill in a std::vector<alpha_token_t> with the successive tokens, you could easily do that by defining AlphaLexer as a subclass of yyFlexLexer, with an instance member called tokens (or, perhaps, with accessor functions).

    You can also add additional member functions to your new class, which might provide what you need those additional arguments for.

    The thing which is not quite so straight-forward, although it could easily be accomplished using the YY_DECL macro in the C interface, is to change the name and prototype of the scanning function generated by flex. It can be done (see below) but it is not clear that it is actually supported. In any case, it is possibly less important in the case of C++.

    Aside from a small wrinkle created by the curious organization of Flex's C++ classes [Note 1], subclassing the lexer class is simple. You need to derive your class from yyFlexLexer [Note 2], which is declared in FlexLexer.h, and you need to tell Flex what the name of your class is, either by using %option yyclass in your Flex file, or by specifying the name on the command line with --yyclass.

    yyFlexLexer includes the various methods for manipulating input buffers, as well as all the mutable state for the lexical scanner used by the standard skeleton. (Much of this is actually derived from the base class FlexLexer.) It also includes a virtual yylex method with prototype

    virtual int yylex();
    

    When you subclass yyFlexLexer, yyFlexLexer::yylex() is defined to signal an error by calling yyFlexLexer::LexerError(const char*) and the generated scanner is defined as the override in the class defined as yyclass. (If you don't subclass, the generated scanner is yyFlexLexer::yylex().)

    The one wrinkle is the way you need to declare your subclass. Normally, you would do that in a header file like this:

    File: myscanner.h (Don't use this version)

    #pragma once
    
    // DON'T DO THIS; IT WON'T WORK (flex 2.6)
    #include <yyFlexLexer.h>
    
    class MyScanner : public yyFlexLexer {
      // whatever
    };
    

    You would then #include "myscanner.h" in any file which needed to use the scanner, including the generated scanner itself.

    Unfortunately, that won't work because it will result in FlexLexer.h being included twice in the generated scanner; FlexLexer.h does not have an include guard in the normal sense of the word because it is designed to be included multiple times in order to support the prefix option. So you need to define two header files:

    File: myscanner-internal.h

    #pragma once
    // This file depends on FlexLexer.h having already been included
    // in the translation unit. Don't use it other than in the scanner
    // definition.
    class MyScanner : public yyFlexLexer {
      // whatever
    };
    

    File: myscanner.h

    #pragma once
    #include <FlexLexer.h>
    #include "myscanner.h"
    

    Then you use #include "myscanner.h" in every file which needs to know about the scanner except the scanner definition itself. In your myscanner.ll file, you will #include "myscanner-internal.h", which works because Flex has already included FlexLexer.h before it inserts the prologue C++ code from your scanner definition.

    Changing the yylex prototype

    You can't really change the prototype (or name) of yylex, because it is declared in FlexLexer.h and, as mentioned above, defined to signal an error. You can, however, redefine YY_DECL to create a new scanner interface. To do so, you must first #undef the existing YY_DECL definition, at least in your scanner definition, because a scanner with %option yyclass="MyScanner" contains #define YY_DECL int MyScanner::yylex(). That would make your myscanner-internal.h` file look like this:

    #pragma once
    // This file depends on FlexLexer.h having already been included
    // in the translation unit. Don't use it other than in the scanner
    // definition.
    
    #undef YY_DECL
    #define YY_DECL int MyScanner::alpha_yylex(std::vector<alpha_token_t>& tokens)
    
    #include <vector>
    #include "alpha_token.h"
    
    class MyScanner : public yyFlexLexer {
      public:
        int alpha_yylex(std::vector<alpha_token_t>& tokens);
    
        // whatever else you need
    };
    

    The fact that the MyScanner object still has a (not very functional) yylex method might not be a problem. There are some undocumented interfaces in FlexLexer which call yylex(), but those don't matter if you don't use them. (They're not all that useful, anyway.) But you should at least be aware that the interface exists.

    In any case, I don't see the point of renaming yylex (but perhaps you have a different aesthetic sense). It's already effectively namespaced by being a member of a specific class (MyScanner, above), so yylex doesn't really create any confusion.

    In the particular case of the std::vector<alpha_token_t>& argument, it seems to me that a cleaner solution would be to put the reference as a member variable in the MyScanner class and set it with the constructor or with an accessor method. Unless you actually use different vectors at different points in the lexical analysis -- not evident in the example code in your question -- there's no point burdening every call site with the need to pass the address of the vector into the yylex call. Since lexer actions are compiled inside yylex, which is a member function of MyScanner, instance variables -- even private instance variables -- are usable in the lexer actions. Of course, that's not the only use case for extra yylex arguments, but it's a pretty common one.


    Notes

    1. "The C++ interface is a mess," according to a comment in the generated code.

    2. Using %option prefix, you can change yy to something else if you want to. This a feature which is supposedly intended to allow you to include multiple lexical scanners in the same project. However, if you're planning on subclassing, the base classes for all these lexical scanners will be identical (other than their names). Thus, there is little or no point having different base classes. Renaming the scanner class using %option prefix is less flexible and no more efficient than subclassing, and it creates an additional header complication. (See this older answer for details.) So I'd recommend sticking with subclassing.