Search code examples
pythondictionarypyparsing

Pyparsing: Parse Dictionary-Like Structure into an actual Dictionary


I'm trying to parse a config file into a python dictionary. I can't change the syntax of the file.

I'm using pyparsing. Here is my code so far.

def file_parser():

    # Example data
    data = """
    root {
        level_one {
            key = value
            local {
                auth = psk
            }
            remote {
                auth = psk
            }
            children {
                net {
                    local_ts  = 1.1.0.0/16 
                    updown = /usr/local/test noticethespace
                    esp_proposals = yht123-h7583
                }
            }
            version = 2
            proposals = ydn162-jhf712-h7583
        }
    }

    usr {
        level_one {
            key = value
        }
    }
    """

    integer = Word(nums)
    ipAddress = Combine(integer + "." + integer + "." + integer + "." + integer)
    name = Word(alphas + "_-")
    any_word = Word(printables, excludeChars="{} ")
    EQ, LBRACE, RBRACE = map(Suppress, "={}")

    gram = Forward()

    entry = Group(name + ZeroOrMore(EQ) + gram)

    struct = Dict(LBRACE + OneOrMore(entry) + RBRACE)

    gram << (struct | ipAddress | name | any_word)

    result = Dict(OneOrMore(entry)).parseString(data)

    print(result)

When i run this code i get the following error:

pyparsing.ParseException: Expected {Dict:({{Suppress:("{") {Group:({W:(ABCD...) [Suppress:("=")]... : ...})}...} Suppress:("}")}) | Combine:({W:(0123...) "." W:(0123...) "." W:(0123...) "." W:(0123...)}) | W:(ABCD...) | W:(0123...)}, found 'c'  (at char 191), (line:11, col:13)

Parts of this code where extracted from this answer. I adapted this code to work with my specific format.


Solution

  • Parsing a recursive grammar always takes some extra thinking. One step I always always always encourage parser devs to take is to write a BNF for your syntax before writing any code. Sometimes you can do this based on your own original syntax design, other times you are trying to reconstruct BNF from example text. Either way, writing a BNF puts your brain in a creative zone instead of coding's logical zone, and you think in parsing concepts instead of code.

    In your case, you are in the second group, where you are reconstructing a BNF based on sample text. You look at the example and see that there are parts that have names, and that it looks like a nested dict would be a nice target to shoot for. What are the things that have names? Some things are named by a 'name = a-value' kind of arrangement, other things are named by 'name structure-in-braces'. You created code that follows something like this BNF:

    name ::= (alpha | "_-")+
    integer ::= digit+
    ip_address ::= integer '.' integer '.' integer '.' integer
    any_word ::= printable+
    entry ::= name '='* gram
    struct ::= '{' entry+ '}'
    gram ::= struct | ip_address | name | any_word+
    

    In your code, you tried to create one entry expression that handles both of these (with ZeroOrMore(EQ)), and that is the kind of optimization that happens when you jump to code too soon. But these are very different, and should be kept separate in your BNF.

    (You also have underspecified your IP address, which in your sample code has a trailing "/16".)

    There is also the problematic any_word, which does not handle the value consisting of multiple words, and if extended with OneOrMore will probably read too many words and eat the next name.

    So let's start over, and think about your named elements. Here are the lines where you have name = something:

    auth = psk
    auth = psk
    local_ts  = 1.1.0.0/16 
    updown = /usr/local/test noticethespace
    esp_proposals = yht123-h7583
    version = 2
    proposals = ydn162-jhf712-h7583
    

    If we want to define an expression as name_value = name + EQ + value, then value is going to be an IP address, an integer, or just whatever else is left on the line. If you find that there are additional types, you'll need to include other types in this value expression, but be sure to put "whatever else is left on the line" last.

    For the nested case, we want to have name_struct = name + struct, where struct is a list of name_values or name_structs, enclosed in braces. That's really all that needs to be said for name_struct.

    Here is the BNF I constructed from this description:

    name ::= alpha + ('_' | '-' | alpha)*
    integer ::= digit+
    ip_address ::= integer '.' integer '.' integer '.' integer ['/' integer]
    value ::= ip_address | integer | rest_of_the_line
    name_value ::= name '=' value
    name_struct ::= name struct
    struct ::= '{' (name_value | name_struct)* '}'
    

    and the overall parser is one or more name_structs.

    Following this BNF and translating it into pyparsing expressions, I converted file_parser() to just return the generated parser - including the sample text and parsing and printing it was too much to include in this one method. Instead the code reads:

    data = """...sample text..."""
    result = file_parser().parseString(data, parseAll=True)
    result.pprint()
    

    And prints out:

    [['root',
      ['level_one',
       ['key', 'value'],
       ['local', ['auth', 'psk']],
       ['remote', ['auth', 'psk']],
       ['children',
        ['net',
         ['local_ts', '1.1.0.0/16'],
         ['updown', '/usr/local/test noticethespace'],
         ['esp_proposals', 'yht123-h7583']]],
       ['version', '2'],
       ['proposals', 'ydn162-jhf712-h7583']]],
     ['usr', ['level_one', ['key', 'value']]]]
    

    I'm leaving the implementation of file_parser to you based on these suggestions. In other questions on SO, I go ahead and post the actual parser, but I always wonder if I'm doing too much spoon-feeding, and not leaving the learning experience more in the OP's hands. So I'm stopping here, with the assurance that implementing a parser in file_parser() by following the above BNF will produce a working solution.

    Some tips:

    • use pyparsing's restOfLine to read everything up to the end of the line (in place of any_word)
    • use Group for name_value and name_struct, and then struct becomes simply struct <<= Dict(LBRACE + (key_value | key_struct)[...] + RBRACE), where I'm using [...] as the new notation for ZeroOrMore
    • the final overall parser will be Dict(name_struct[...])