Search code examples
pythonlinuxparsingpyparsing

How to parse a list of lines into a single group based on the line prefix using pyparsing


I am trying to parse the output of the command ip netns exec vpn_ns ipsec stroke statusall (example pasted below).

The command provides multiple lines for each service (oof-#n-#i) terminator (#n) and instance using that terminator (#i), so

oof-2-1 is terminator server oof-2 instance 1.

How do I declare a match that collects all the lines prefixed by the same id?

From the example I am trying to get to something like this dict:

results = {
    'connections':
        {
            'oof-1-1': [ 3 lines starting with oof-1-1 in section "Connections" ],
            'oof-1-2': [ 3 lines starting with oof-1-2 in section "Connections" ]
            'oof-2-1': [ 3 lines starting with oof-2-1 in section "Connections" ]
        },

    'sec_assocs':
        {
            'oof-1-1': [ 3 lines starting with oof-1-1 in section "Security Associations" ],
            'oof-1-2': [ 3 lines starting with oof-1-2 in section "Security Associations" ]
            'oof-2-1': [ 3 lines starting with oof-2-1 in section "Security Associations" ]
        }
}

Where each id contains a list of the lines that start with it.

This is the full output from the StrongSwan command.

sample = """
Status of IKE charon daemon (strongSwan 5.9.1, Linux 4.15.0-162-generic, x86_64):
  uptime: 25 hours, since Mar 23 15:23:53 2022
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 10
  loaded plugins: charon aesni 
Listening IP addresses:
  169.254.123.2
  192.168.51.254
Connections:
     oof-1-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-1:   remote: [server] uses public key authentication
     oof-1-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-1-2:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-2:   remote: [server] uses public key authentication
     oof-1-2:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-2-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-2-1:   remote: [server] uses public key authentication
     oof-2-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd
Security Associations:
     oof-1-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-1:   remote: [server] uses public key authentication
     oof-1-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-1-2:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-2:   remote: [server] uses public key authentication
     oof-1-2:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-2-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-2-1:   remote: [server] uses public key authentication
     oof-2-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd
"""

And this is the sample that is used in the parsing solution:

sample = """
Connections:
     oof-1-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-1:   remote: [server] uses public key authentication
     oof-1-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-1-2:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-2:   remote: [server] uses public key authentication
     oof-1-2:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-2-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-2-1:   remote: [server] uses public key authentication
     oof-2-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd
Security Associations:
     oof-1-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-1:   remote: [server] uses public key authentication
     oof-1-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-1-2:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-1-2:   remote: [server] uses public key authentication
     oof-1-2:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart
     oof-2-1:  %any...10.1.0.242  IKEv2, dpddelay=30s
     oof-2-1:   remote: [server] uses public key authentication
     oof-2-1:   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd
"""

Solution

  • Post-processing is the most direct way to go with this kind of handling of the parsed data. Here is the BNF for the structuring you are trying to parse:

    group ::= label ':' line...
    label ::= word...
    line ::= prefix ':' rest_of_line
    prefix ::= word '-' int '-' int
    

    where word and int are just a Word of alphas or nums, and '...' indicates repetition.

    This translates to pyparsing as:

    import pyparsing as pp
    
    COLON = pp.Suppress(":")
    label = pp.Combine(
                pp.Word(pp.alphas)[1, ...], adjacent=False, joinString=" "
                )
    prefix = pp.Combine(
                pp.Word(pp.alphas) + "-" + pp.Word(pp.nums) + "-" + pp.Word(pp.nums)
                )
    post_prefix = COLON + pp.restOfLine
    line = pp.Group(prefix("prefix") + post_prefix)
    lines = pp.Group(line[...])
    group = pp.Group(label("group_label") + COLON + lines("subgroups"))
    

    Pyparsing will generate this railroad diagram for you:

    parser railroad diagram

    This parses your text, but to regroup the lines by their prefixes, we can add a parse action that uses itertools.groupby:

    def regroup_lines(t):
        from itertools import groupby
        from operator import itemgetter
    
        ret = pp.ParseResults([])
        parsed_lines = t[0]
        for prefix, subgroup in groupby(parsed_lines, key=itemgetter("prefix")):
            # each line in subgroup has the prefix and the rest of the line after the ':'
            # repackage the multiple lines into a single group that is labeled with 
            # the common prefix, and contains the line contents
            ret.append(pp.ParseResults.from_dict(
                {
                    'prefix': prefix,
                    'lines': [line[1] for line in subgroup],
                }
            ))
        return ret
    
    lines.add_parse_action(regroup_lines)
    

    By using a parse action, the regrouping is done at parse time, so no additional post-parsing processing is needed.

    Now we can parse your sample and get the regrouped results:

    results = group[...].parseString(sample)
    

    Here is a short function to print out the parsed groups:

    def print_groups(parsed):
        for group in parsed:
            print(group.group_label)
            for subgroup in group.subgroups:
                print(f"- {subgroup.prefix}")
                for line in subgroup.lines:
                    print(f"  {line!r}")
            print()
    
    print_groups(results)
    

    Which gives:

    Connections
    - oof-1-1
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart'
    - oof-1-2
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart'
    - oof-2-1
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd'
    
    Security Associations
    - oof-1-1
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart'
    - oof-1-2
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restart'
    - oof-2-1
      '  %any...10.1.0.242  IKEv2, dpddelay=30s'
      '   remote: [server] uses public key authentication'
      '   child:  dynamic === 0.0.0.0/0 TUNNEL, dpdaction=restartd'
    

    Here is the full source for the working example:

    import pyparsing as pp
    
    COLON = pp.Suppress(":")
    label = pp.Combine(pp.Word(pp.alphas)[1, ...], adjacent=False, joinString=" ")
    label.setName("label")
    prefix = pp.Combine(pp.Word(pp.alphas) + "-" + pp.Word(pp.nums) + "-" + pp.Word(pp.nums))
    prefix.setName("prefix")
    post_prefix = COLON + pp.restOfLine
    line = pp.Group(prefix("prefix") + post_prefix)
    lines = pp.Group(line[...])
    
    
    def regroup_lines(t):
        from itertools import groupby
        from operator import itemgetter
    
        ret = pp.ParseResults([])
        for prefix, subgroup in groupby(t[0], key=itemgetter("prefix")):
            ret.append(pp.ParseResults.from_dict(
                {
                    'prefix': prefix,
                    'lines': [line[1] for line in subgroup],
                }
            ))
        return ret
    lines.add_parse_action(regroup_lines)
    
    group = pp.Group(label("group_label") + COLON + lines("subgroups"))
    pp.autoname_elements()
    group.create_diagram("groupby_1.html", show_results_names=True)
    results = group[...].parseString(sample)
    
    
    def print_groups(parsed):
        for group in parsed:
            print(group.group_label)
            for subgroup in group.subgroups:
                print(f"- {subgroup.prefix}")
                for line in subgroup.lines:
                    print(f"  {line!r}")
            print()
    
    print_groups(results)