Search code examples
pythonpyparsing

Parse nested references in string


Context

In Python, given is an arbitrary string containing nested, dot-notated references, which will be replaced later with actual values.

str = 'Allocate {ref:network.node.{ref:global.environment}.api} with {ref:local.value}'

References need to be replaced from the inside out, so ref:global.environment='prod' first, then ref:network.node.prod.api='/prod/api', ref:local.value='UUID', so the result is:

result = 'Allocate /prod/api with UUID'

Parsing

Trying to solve this with pyparsing, since regex is kinda lost with the nested references. Goal is to have a list of references, which I can process/replace in subsequent steps later.

lbrack = '{ref:'
rbrack = '}'

ref = Forward()
ref << lbrack + Word(alphas, alphanums + '.') + ZeroOrMore(ref) + rbrack

ref.parseString(str)

Result similar to this one would be helpful:

references = [['{ref:global.environment}'], '{ref:network.node.{ref:global.environment}.api}', '{ref:local.value}']

But I miss some parsing instructions to get this working, maybe you have an idea. Thanks for your support.

#Update 1 - Solution

Picking up the answer of PaulMcG, the current code is:

lbrack = "{ref:"
rbrack = "}"
ref = pp.Forward()
ident = pp.Word(pp.alphas, pp.alphanums)
ref <<= pp.Group(lbrack + pp.delimitedList(ref | ident, delim=".") + rbrack)

def eval_ref(tokens):
    # Skip lbrack and rbrack, i.e. [1:-1]
    return reduce(operator.getitem, tokens[0][1:-1], ns)

ref.addParseAction(eval_ref)
test = 'Allocate {ref:network.node.{ref:global.environment}.api} with {ref:local.value}'
print(ref.transformString(test))

As a side note, reduced the code of eval_ref() to a single line.


Solution

  • You are on the right track, but I need to help you with a common fundamental problem.

    I see this a lot when people define qualified identifiers using:

    Word(alphas, alphanums + ".")
    

    It has some inherent issues, since it will match not only "a" and "a.b.c.d", but also "a.", "a...", "a.c.", and "a..c..0." all as identifiers. In your case, you also want to support an embedded ref in place of a qualifying identifier.

    So instead, think of it this way:

    qualified_ident ::= ident_term ["." ident_term]...
    ident_term := reference | identifier
    reference := "{ref:" qualified_ident "}"
    identifier := "A-Za-z" "A-Za-z0-9"...
    

    Now your qualified ident can be composed of references, which themselves can be composed of qualified idents.

    In pyparsing this looks like (using delimited list with "." delim for the qualified ident):

    ref = Forward()
    ident = Word(alphas, alphanums)
    ref <<= Group(lbrack + delimitedList(ref | ident, delim=".") + rbrack)
    

    Now delimitedList will suppress the "." delimiters, but we don't really care, since we would just have to step over them anyway. We'll write a parse action to do the resolution of the ref to some lookup data.

    First off, let's create a simple nested dict from some JSON to support your example string:

    # define a nested dict for lookup values from refs
    import json
    ns = json.loads("""
    {
        "network" : {
            "node" : {
                "prod": {
                    "api": "prod_api"
                },
                "test": {
                    "api": "test_api"
                }
            }
        },
        "local" : {
            "value" : 1000
        },
        "global": {
            "environment" : "prod"
        }
    }
    """)
    

    Now we will write a parse action that will evaluate a reference's path using this namespace dict:

    def eval_ref(tokens):
        ret = ns
    
        # uncomment for debugging
        # print(tokens[0])
    
        # resolve next level down in the reference path
        for t in tokens[0][1:-1]:
            ret = ret[t]
        return ret
    
    # and add as a parse action to ref
    ref.addParseAction(eval_ref)
    

    That should do it, let's try it on your test string (which I renamed, since str is a builtin type in Python, not good to mask it with your variable name). We'll use transformString instead of parseString though. transformString will replace any source text with the text emitted by any parse actions (or suppressed if wrapped in Suppress), and this will happen for you recursively, so that your inner ref will get evaluated, then the outer ref will get evaluated using that inner resolved value.

    test = 'Allocate {ref:network.node.{ref:global.environment}.api} with {ref:local.value}'
    print(ref.transformString(test))
    

    Should give:

    Allocate prod_api with 1000