Search code examples
pythonparsingpyparsing

Parsing a custom format (curly braces separated) text configuration with Pyparsing


I need to parse some load balancer configuration section. It's seemingly simple (at least for a human).

Config consists of several objects with their content in curly braces like so:

ltm rule ssl-header-insert {
    when HTTP_REQUEST {
  HTTP::header insert "X-SSL-Connection" "yes"
}
}
ltm rule some_redirect {
    priority 1

when HTTP_REQUEST {

    if { (not [class match [IP::remote_addr] equals addresses_group ]) }
    {
        HTTP::redirect "http://some.page.example.com"
        TCP::close
        event disable all
    }
}

The contents of each section/object is a TCL code so there will be nested curly braces. What I want to achieve is to parse this in pairs as: object identifier (after ltm rule keywords) and it's contents (tcl code within braces) as it is.

I've looked around some examples and experimented a lot, but it's really giving me a hard time. I did some debugging within pyparsing (which is a bit confusing to me too) and I think that I'm failing to detect closing braces somehow, but can't figure that out.

What I came up with so far:

from pyparsing import *
import json

list_sample = """ltm rule ssl-header-insert {
    when HTTP_REQUEST {
  HTTP::header insert "X-SSL-Connection" "yes"
}
}
ltm rule some_redirect {
    priority 1

when HTTP_REQUEST {

    if { (not [class match [IP::remote_addr] equals addresses_group ]) }
    {
        HTTP::redirect "http://some.page.example.com"
        TCP::close
        event disable all
    }
}
}
ltm rule http_header_replace {
    when HTTP_REQUEST {

        HTTP::header replace Host some.host.example.com

}
}"""

ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd()
END = StringEnd()

LBRACE, RBRACE = map(Suppress, '{}')
ANY_HEADER = Suppress("ltm rule ") + Word(alphas, alphanums + "_-")
END_MARK = Literal("ltm rule")

CONTENT_LINE = (~ANY_HEADER + (NotAny(RBRACE + FollowedBy(END_MARK)) + ~END + restOfLine) | (~ANY_HEADER + NotAny(RBRACE + FollowedBy(END)) + ~END + restOfLine)) | (~RBRACE + ~END + restOfLine)

ANY_HEADER.setName("HEADER").setDebug()
LBRACE.setName("LBRACE").setDebug()
RBRACE.setName("RBRACE").setDebug()
CONTENT_LINE.setName("LINE").setDebug()

template_defn = ZeroOrMore((ANY_HEADER + LBRACE +
                 Group(ZeroOrMore(CONTENT_LINE)) +
                 RBRACE))
template_defn.ignore(NL)


results = template_defn.parseString(list_sample).asList()

print("Raw print:")
print(results)
print("----------------------------------------------")
print("JSON pretty dump:")
print json.dumps(results, indent=2)

I see in the debug that some of the matches work but in the end it fails with an empty list as a result. On a sidenote - my CONTENT_LINE part of the grammar is probably overly complicated in general, but I didn't find any simpler way to cover it so far.

The next thing would be to figure out how to preserve new lines and tabs in content part, since I need that to be unchanged in the output. But looks like I have to use ignore() function - which is skipping new lines - to parse the multiline text in the first place, so that's another challenge.

I'd be grateful for someone to help me find out what the issues are. Or maybe I should take some other approach?


Solution

  • I think nestedExpr('{', '}') will help. That will take care of the nested '{}'s, and wrapping in originalTextFor will preserve newlines and spaces.

    import pyparsing as pp
    
    LTM, RULE = map(pp.Keyword, "ltm rule".split())
    ident = pp.Word(pp.alphas, pp.alphanums+'-_')
    
    ltm_rule_expr = pp.Group(LTM + RULE 
                             + ident('name') 
                             + pp.originalTextFor(pp.nestedExpr('{', '}'))('body'))
    

    Using your sample string (after adding missing trailing '}'):

    for rule, _, _ in ltm_rule_expr.scanString(sample):
        print(rule[0].name, rule[0].body.splitlines()[0:2])
    

    gives

    ssl-header-insert ['{', '    when HTTP_REQUEST {']
    some_redirect ['{', '    priority 1']
    

    dump() is also a good way to list out the contents of a returned ParseResults:

    for rule, _, _ in ltm_rule_expr.scanString(sample):
        print(rule[0].dump())
        print()
    
    ['ltm', 'rule', 'ssl-header-insert', '{\n    when HTTP_REQUEST {\n  HTTP::header insert "X-SSL-Connection" "yes"\n}\n}']
    - body: '{\n    when HTTP_REQUEST {\n  HTTP::header insert "X-SSL-Connection" "yes"\n}\n}'
    - name: 'ssl-header-insert'
    
    ['ltm', 'rule', 'some_redirect', '{\n    priority 1\n\nwhen HTTP_REQUEST {\n\n    if { (not [class match [IP::remote_addr] equals addresses_group ]) }\n    {\n        HTTP::redirect "http://some.page.example.com"\n        TCP::close\n        event disable all\n    }\n}}']
    - body: '{\n    priority 1\n\nwhen HTTP_REQUEST {\n\n    if { (not [class match [IP::remote_addr] equals addresses_group ]) }\n    {\n        HTTP::redirect "http://some.page.example.com"\n        TCP::close\n        event disable all\n    }\n}}'
    - name: 'some_redirect'
    

    Note that I broke up 'ltm' and 'rule' into separate keyword expressions. This guards against the case where a developer may have written valid code as ltm rule blah, with > 1 space between "ltm" and "rule". This kind of thing happens all the time, you never know where whitespace will crop up.