Search code examples
pythonpyparsing

pyparsing input to and specific output


I'm thinking how to parse the following input:

comment ='  @Class wordinfo dict<<position:int>,wordinfo:str>\n ' + \
                           '@Class instances dict<<word:str>,instances:atomicint> '

To a specific output:

{'wordinfo': {'columns': [('wordinfo', 'text')],
              'primary_keys': [('position', 'int')],
              'type': 'StorageDict'},

 'instances': {'columns': [('instances', 'counter')],
               'primary_keys': [('word', 'text')],
               'type': 'StorageDict'}
}

As we can see above, I need to take the key of the dictionary as a primary key and then I can have one or more values as a columns, first I always have the variable name and then the variable type.I'm asking myself if there's some basic way to get the result I want since I'm not an expert with pyparsing. It is feasible? What will be the steps I need to do?


Solution

  • First step is to write a BNF. You are already started with this thinking, when you wrote: I need to take the key of the dictionary as a primary key and then I can have one or more values as a columns, first I always have the variable name and then the variable type.

    Convert this to something more formal:

    class_definition :: '@Class' identifier class_body
    class_body :: class_dict  // can add other types here as necessary
    class_dict :: 'dict' '<' '<' identifier ':' value_type '>' ','
                         column_decl [',' column_decl]... '>'
    column_decl :: identifier ':' value_type
    value_type :: 'int' | 'str' | 'atomicint'
    

    Hmmm, identifier : value_type is in a couple of places, let's call that var_decl and rewrite. Also, I think it is possible for you to have compound primary keys, by defining a comma-separated list inside the <>s, and we use this kind of list in a couple of places. Rewriting:

    class_definition :: '@Class' identifier class_body
    class_body :: class_dict  // can add other types here as necessary
    class_dict :: 'dict' '<' '<' vars_decl '>' ',' vars_decl '>'
    vars_decl :: var_decl [',' var_decl]...
    var_decl :: identifier ':' value_type
    value_type :: 'int' | 'str' | 'atomicint'
    

    Then work from the bottom-up to define these in pyparsing terms:

    import pyparsing as pp
    S = pp.Suppress
    identifier = pp.pyparsing_common.identifier
    value_type = pp.oneOf("int str atomicint")
    var_decl = pp.Group(identifier + S(":") + value_type)
    vars_decl = pp.Group(pp.delimitedList(var_decl))
    dict_decl = pp.Group(S("dict") + S("<") 
                         + S("<") + vars_decl + S(">") + S(",")
                         + vars_decl 
                         + S(">"))
    class_decl = pp.Group('@Class' + identifier + dict_decl)
    

    And finally, drop in results names so that you can pick out the different pieces more easily after parsing:

    import pyparsing as pp
    S = pp.Suppress
    identifier = pp.pyparsing_common.identifier
    value_type = pp.oneOf("int str atomicint")
    var_decl = pp.Group(identifier("name") + S(":") + value_type("type"))
    vars_decl = pp.Group(pp.delimitedList(var_decl))
    dict_decl = pp.Group(S("dict") + S("<") 
                         + S("<") + vars_decl("primary_key") + S(">") + S(",")
                         + vars_decl("columns") 
                         + S(">"))
    class_decl = pp.Group('@Class'
                          + identifier("class_name")
                          + dict_decl("class_body"))
    

    Then parse your text using:

    class_definitions = pp.OneOrMore(class_decl).parseString(comment)
    

    And print out what you got:

    print(class_definitions.dump())
    

    Or even better:

    class_decl.runTests(comment)
    

    This is completely untested, may be a mismatched paren in there, but that is the general idea. But even if you end up using something other than pyparsing, start with the BNF. It will really help clarify your thinking, and general concept of the problem.