I'm trying to parse a config file into a python dictionary. I can't change the syntax of the file.
I'm using pyparsing. Here is my code so far.
def file_parser():
# Example data
data = """
root {
level_one {
key = value
local {
auth = psk
}
remote {
auth = psk
}
children {
net {
local_ts = 1.1.0.0/16
updown = /usr/local/test noticethespace
esp_proposals = yht123-h7583
}
}
version = 2
proposals = ydn162-jhf712-h7583
}
}
usr {
level_one {
key = value
}
}
"""
integer = Word(nums)
ipAddress = Combine(integer + "." + integer + "." + integer + "." + integer)
name = Word(alphas + "_-")
any_word = Word(printables, excludeChars="{} ")
EQ, LBRACE, RBRACE = map(Suppress, "={}")
gram = Forward()
entry = Group(name + ZeroOrMore(EQ) + gram)
struct = Dict(LBRACE + OneOrMore(entry) + RBRACE)
gram << (struct | ipAddress | name | any_word)
result = Dict(OneOrMore(entry)).parseString(data)
print(result)
When i run this code i get the following error:
pyparsing.ParseException: Expected {Dict:({{Suppress:("{") {Group:({W:(ABCD...) [Suppress:("=")]... : ...})}...} Suppress:("}")}) | Combine:({W:(0123...) "." W:(0123...) "." W:(0123...) "." W:(0123...)}) | W:(ABCD...) | W:(0123...)}, found 'c' (at char 191), (line:11, col:13)
Parts of this code where extracted from this answer. I adapted this code to work with my specific format.
Parsing a recursive grammar always takes some extra thinking. One step I always always always encourage parser devs to take is to write a BNF for your syntax before writing any code. Sometimes you can do this based on your own original syntax design, other times you are trying to reconstruct BNF from example text. Either way, writing a BNF puts your brain in a creative zone instead of coding's logical zone, and you think in parsing concepts instead of code.
In your case, you are in the second group, where you are reconstructing a BNF based on sample text. You look at the example and see that there are parts that have names, and that it looks like a nested dict would be a nice target to shoot for. What are the things that have names? Some things are named by a 'name = a-value'
kind of arrangement, other things are named by 'name structure-in-braces'
. You created code that follows something like this BNF:
name ::= (alpha | "_-")+
integer ::= digit+
ip_address ::= integer '.' integer '.' integer '.' integer
any_word ::= printable+
entry ::= name '='* gram
struct ::= '{' entry+ '}'
gram ::= struct | ip_address | name | any_word+
In your code, you tried to create one entry
expression that handles both of these (with ZeroOrMore(EQ)
), and that is the kind of optimization that happens when you jump to code too soon. But these are very different, and should be kept separate in your BNF.
(You also have underspecified your IP address, which in your sample code has a trailing "/16"
.)
There is also the problematic any_word
, which does not handle the value consisting of multiple words, and if extended with OneOrMore
will probably read too many words and eat the next name
.
So let's start over, and think about your named elements. Here are the lines where you have name = something
:
auth = psk
auth = psk
local_ts = 1.1.0.0/16
updown = /usr/local/test noticethespace
esp_proposals = yht123-h7583
version = 2
proposals = ydn162-jhf712-h7583
If we want to define an expression as name_value = name + EQ + value
, then value is going to be an IP address, an integer, or just whatever else is left on the line. If you find that there are additional types, you'll need to include other types in this value expression, but be sure to put "whatever else is left on the line" last.
For the nested case, we want to have name_struct = name + struct
, where struct is a list of name_value
s or name_struct
s, enclosed in braces. That's really all that needs to be said for name_struct
.
Here is the BNF I constructed from this description:
name ::= alpha + ('_' | '-' | alpha)*
integer ::= digit+
ip_address ::= integer '.' integer '.' integer '.' integer ['/' integer]
value ::= ip_address | integer | rest_of_the_line
name_value ::= name '=' value
name_struct ::= name struct
struct ::= '{' (name_value | name_struct)* '}'
and the overall parser is one or more name_struct
s.
Following this BNF and translating it into pyparsing expressions, I converted file_parser() to just return the generated parser - including the sample text and parsing and printing it was too much to include in this one method. Instead the code reads:
data = """...sample text..."""
result = file_parser().parseString(data, parseAll=True)
result.pprint()
And prints out:
[['root',
['level_one',
['key', 'value'],
['local', ['auth', 'psk']],
['remote', ['auth', 'psk']],
['children',
['net',
['local_ts', '1.1.0.0/16'],
['updown', '/usr/local/test noticethespace'],
['esp_proposals', 'yht123-h7583']]],
['version', '2'],
['proposals', 'ydn162-jhf712-h7583']]],
['usr', ['level_one', ['key', 'value']]]]
I'm leaving the implementation of file_parser
to you based on these suggestions. In other questions on SO, I go ahead and post the actual parser, but I always wonder if I'm doing too much spoon-feeding, and not leaving the learning experience more in the OP's hands. So I'm stopping here, with the assurance that implementing a parser in file_parser() by following the above BNF will produce a working solution.
Some tips:
restOfLine
to read everything up to the end of the line (in place of any_word)Group
for name_value
and name_struct
, and then struct becomes simply struct <<= Dict(LBRACE + (key_value | key_struct)[...] + RBRACE)
, where I'm using [...]
as the new notation for ZeroOrMore
Dict(name_struct[...])