Search code examples
lexical-analysisragel

How to properly scan for identifiers using Ragel


I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators trigger actions, probably because my academics were focused on practical knowledge rather than theory and great deal of this non-deterministic/deterministic finite automata business goes right over my head. I find the documentation to either be lacking or my understanding of it to be so. I'm assuming the latter.

In any case, I'm working my way up from the basics. I've identified several keywords and special characters in my first iteration. Now I've run into the issue where all keywords are being scanned as identifiers. I'm using the scanner operator for all of my keywords, as that resolved my issue of the string returns being scanned as both the return and returns keyword.

How can I properly scan for identifiers? I understand that to make this deterministic, I need to effectively specify that a lexeme can only be an identifier if it matches no other token's pattern. Forgive my lack of knowledge.

Ragel Script:

%%{
    Identifier = (alpha | '_') . (alnum | '_')*;
    action IdentifierAction
    {
        std::cout << "identifier(\"";
        std::cout.write(ts, te - ts);
        std::cout << "\")";
    }
}%%

%%{
    main :=
    |*
        Interface => InterfaceAction;
        Class => ClassAction;
        Property => PropertyAction;
        Function => FunctionAction;
        TypeQualifier => TypeQualifierAction;
        OpenParenthesis => OpenParenthesisAction;
        CloseParenthesis => CloseParenthesisAction;
        OpenBracket => OpenBracketAction;
        CloseBracket => CloseBracketAction;
        OpenBrace => OpenBraceAction;
        CloseBrace => CloseBraceAction;
        Semicolon => SemicolonAction;
        Returns => ReturnsAction;
        Return => ReturnAction;
        Identifier => IdentifierAction;
        space+;
    *|;
}%%

Solution

  • Not familiar with Ragel, but, have done some custom parsers & scanners.

    Your question seems to relate more to detect keywords, than detect generic identifiers.

    You have rules telling Ragel to detect when a section the code is a number, the "return" keyword, a semicolon, the "returns" keyword, an identifier, and so on. Altought, it's possible to make a rule for each keyword, I won't recommended.

    What I have learn by experience, is that is better to read all keywords explicity as identifiers (assign a general "identifier" token ), and in some part of your C/C++ code, detect which identifiers are "keywords".

    In other words. Ragel will detect only identifiers. "myvar", "return" and "returns", will all be marked as "identifiers". Later, in the code of your semantic action (C/C++ not Ragel), you will check each identifier, and detect if is a keyword in C/C++. This is usually done, by having a list of keywords.

    I think It will be something like these:

    %%{
    Identifier = (alpha | '_') . (alnum | '_')*;
    action IdentifierAction
    {
        String Keywords[] = 
        (
           "return",
           "if",
           "else"
        ); 
    
        String MyIdentifier = te - ts;
        if (SearchKeywordCode(Keywords, MyIdentifier)) {
          std::cout << "keyword(\"";
          std::cout.write(ts, te - ts);
          std::cout << "\")";
        }
        else {
          std::cout << "identifier(\"";
          std::cout.write(ts, te - ts);
          std::cout << "\")";
        }
    }
    }%%
    

    So, there not be a "Return" or "Returns" rule, just "Identifier".