Search code examples
pyparsing

pyparsing parse c/cpp enums with values as user defined macros


I have a usecase where i need to match enums where values can be userdefined macros.

Example enum

typedef enum
{
  VAL_1 = -1
  VAL_2 =  0,
  VAL_3 = 0x10,
  VAL_4 = **TEST_ENUM_CUSTOM(1,2)**,
}MyENUM;

I am using the below code, if i don't use format as in VAL_4 it works. I need match format as in VAL_4 as well. I am new to pyparsing, any help is appeciated.

My code:

BRACE, RBRACE, EQ, COMMA = map(Suppress, "{}=,")

_enum = Suppress("enum")
identifier = Word(alphas, alphanums + "_")
integer = Word("-"+alphanums) **#I have tried to "_(,)" to this but is not matching.**

enumValue = Group(identifier("name") + Optional(EQ + integer("value")))
enumList = Group(enumValue + ZeroOrMore(COMMA + enumValue) + Optional(COMMA))
enum = _enum + Optional(identifier("enum")) + LBRACE + enumList("names") + RBRACE + Optional(identifier("typedef"))

enum.ignore(cppStyleComment)
enum.ignore(cStyleComment)

Thanks

-Purna


Solution

  • Just adding more characters to integer is just the wrong way to go. Even this expression:

    integer = Word("-"+alphanums)
    

    isn't super-great, since it would match "---", "xyz", "q--10-", and many other non-integer strings.

    Better to define integer properly. You could do:

    integer = Combine(Optional('-') + Word(nums))
    

    but I've found that for these low-level expressions that occur many places in your parse string, a Regex is best:

    integer = Regex(r"-?\d+") # Regex(r"-?[0-9]+") if you like more readable re's
    

    Then define one for hex_integer also,

    Then to add macros, we need a recursive expression, to handle the possibility of macros having arguments that are also macros.

    So at this point, we should just stop writing code for a bit, and do some design. In parser development, this design usually looks like a BNF, where you describe your parser in a sort of pseudocode:

    enum_expr ::= "typedef" "enum" [identifier] 
                    "{" 
                        enum_item_list 
                    "}" [identifier] ";"
    
    enum_item_list ::= enum_item ["," enum_item]... [","]
    enum_item ::= identifier "=" enum_value
    
    enum_value ::= integer | hex_integer | macro_expression
    macro_expression ::= identifier "(" enum_value ["," enum_value]... ")"
    

    Note the recursion of macro_expression: it is used in defining enum_value, but it includes enum_value as part of its own definition. In pyparsing, we use a Forward to set up this kind of recursion.

    See how that BNF is implemented in the code below. I build on some of the items you posted, but the macro expression required some rework. The bottom line is "don't just keep adding characters to integer trying to get something to work."

    LBRACE, RBRACE, EQ, COMMA, LPAR, RPAR, SEMI = map(Suppress, "{}=,();")
    
    _typedef = Keyword("typedef").suppress()
    _enum = Keyword("enum").suppress()
    identifier = Word(alphas, alphanums + "_")
    
    # define an enumValue expression that is recursive, so that enumValues
    # that are macros can take parameters that are enumValues
    enumValue = Forward()
    
    # add more types as needed - parse action on hex_integer will do parse-time
    # conversion to int
    integer = Regex(r"-?\d+").addParseAction(lambda t: int(t[0]))
    # or just use the signed_integer expression found in pyparsing_common
    # integer = pyparsing_common.signed_integer
    hex_integer = Regex(r"0x[0-9a-fA-F]+").addParseAction(lambda t: int(t[0], 16))
    
    # a macro defined using enumValue for parameters
    macro_expr = Group(identifier + LPAR + Group(delimitedList(enumValue)) + RPAR)
    
    # use '<<=' operator to attach recursive definition to enumValue
    enumValue <<= hex_integer | integer | macro_expr
    
    # remaining enum expressions
    enumItem = Group(identifier("name") + Optional(EQ + enumValue("value")))
    enumList = Group(delimitedList(enumItem) + Optional(COMMA))
    enum = (_typedef
            + _enum
            + Optional(identifier("enum"))
            + LBRACE
            + enumList("names")
            + RBRACE
            + Optional(identifier("typedef"))
            + SEMI
            )
    
    # this comment style includes cStyleComment too, so no need to
    # ignore both
    enum.ignore(cppStyleComment)
    

    Try it out:

    enum.runTests([
        """
        typedef enum
        {
          VAL_1 = -1,
          VAL_2 =  0,
          VAL_3 = 0x10,
          VAL_4 = TEST_ENUM_CUSTOM(1,2)
        }MyENUM;
        """,
        ])
    

    runTests is for testing and debugging your parser during development. Use enum.parseString(some_enum_expression) or enum.searchString(some_c_header_file_text) to get the actual parse results.

    Using the new railroad diagram feature in the upcoming pyparsing 3.0 release, here is a visual representation of this parser: enum_parser railroad diagram