Parsing a custom configuration format in Python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.

Mod file:

name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
    "Fixes"
}
remote_file_id="879973318"
supported_version="1.6"

Settings:

language="l_english"
graphics={
    size={
        x=1920
        y=1200
    }
    min_gui={
        x=1920
        y=1200
    }
    gui_scale=1.000000
    gui_safe_ratio=1.000000
    refreshRate=59
    fullScreen=no
    borderless=no
    display_index=0
    shadowSize=2048
    multi_sampling=8
    maxanisotropy=16
    gamma=50.000000
    vsync=yes
}
last_mods={
    "mod/ship_designer_ui_fix.mod"
    "mod/ugc_720237457.mod"
    "mod/ugc_775944333.mod"
}

I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.

I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.

lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()

nonequals = "".join([c for c in printables if c != "="]) + " \t"

keydef = ~lbrack + Word(nonequals) + equals + restOfLine

conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)

I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)

Solution

Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.

Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".

There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:

identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*

The right-hand sides are a mix of things:

integers
floats
'yes' or 'no' booleans
quoted strings
something in braces

The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:

boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'

You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.

The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.