Search code examples
python-3.xparsingnetworkingpyparsing

Why is PyParsing taking so much longer to parse vs RegEx? Is it because it creates objects instead of dicts?


Is PyParsing slower than RegEx because it creates objects instead of dicts? and if so, can this be improved?

I have a an output of almost 400,000 lines that describe 40,000 items in a table of a router.

I have 2 parsers, written in PyParsing and RegEx that do the same task. The difference in performance is around 1:15 to 1:18 in favor of RegEx and ParserElement.enablePackrat() makes things worse.

  1. I suspect that PyParsing is working "harder" because it generates objects, while RegEx generates dicts.

  2. Did I miss something in the PyParsing grammar that makes it run slower than it should?

  3. Is PyParsing intended for this output scale?


Footnote: I am using the parsers as part of an automation framework. My main goal is to provide users and future maintainers an easy to understand, use and maintainable code PyParsing allows for that and I prefer using it. In 99% of the cases, the amount of lines to parse is not this high, so PyParsing is the prefered tool. Following @PaulMcG 's reply, I will look into refining the parser.


RegEx

Output

started parse
finished parse, took 0.40858237000065856s, processed 7000 entries
RegEx

Result #0 Result #0

Code

import re
import time

output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
  bundle-201.1,                 Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:

(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
  bundle-201.1                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
  lo5,                          Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
  bundle-1215,                  Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
  bundle-1218,                  Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:

(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

"""
# output = output * 5715
output = output * 1000

sg_line_pattern = re.compile(r'\((?P<source>[\d\.\*]+),'
                             r' (?P<group>[\d\.]+)\)'
                             r' (?P<group_type>\w+),'
                             r' Uptime: (?P<uptime>[\d:\.]+)'
                             r'(, RP-Address: (?P<rp_address>[\d\.]+))?')

join_prune_line_pattern = re.compile(r'Upstream Join/Prune: (?P<upstream_join_prune>.*?),'
                                     r' RPF: (?P<rpf>[\d\.\*]+),'
                                     r' Flags:(?P<flags>.*)')

iif_lines_pattern = re.compile(r'Incoming Interface:(?P<iif_lines>\n(.|\n)*?)Output Interface List:')

oil_lines_pattern = re.compile(r'Output Interface List:(?P<oil_lines>\n(.|\n)*)')

iif_oil_line_pattern = re.compile(r'\s*(?P<interface>[\w\d\-\./]+)(,)?'
                                  r'\s+Uptime: (?P<uptime>[\d:\.]+),'
                                  r' status: (?P<status>\w+),'
                                  r'( (?P<join_prune_state>[\w\s\d():]+),)?'
                                  r'\s+Flags:(?P<flags>.*)')

print(f"started parse")
start_time = time.monotonic()

output = output.split("\n\n")
output = [entry for entry in output if entry]

result = []

for entry in output:
    iifs = None
    oils = None
    sg_line = re.search(sg_line_pattern, entry).groupdict()
    join_prune_line = re.search(join_prune_line_pattern, entry).groupdict()
    iif_lines = re.search(iif_lines_pattern, entry)
    oil_lines = re.search(oil_lines_pattern, entry)
    if iif_lines:
        iif_lines = iif_lines.groupdict()
        iifs = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, iif_lines['iif_lines'])]
        iifs = {entry["interface"]: entry for entry in iifs}
    if oil_lines:
        oil_lines = oil_lines.groupdict()
        oils = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, oil_lines['oil_lines'])]
        oils = {entry["interface"]: entry for entry in oils}
    group = sg_line['group']
    source = sg_line['source']

    entry_dict = {**sg_line, **join_prune_line, "iifs": iifs, "oils": oils}
    result.append(entry_dict)

end_time = time.monotonic()

print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('RegEx')

PyParsing

Output:

started parse
finished parse, took 1.3630201699997997s, processed 700 entries
PyParsing

Result #0 Result #0

Code:

from pyparsing import Word, Keyword, nums, OneOrMore, Optional, Suppress, Literal, alphanums, LineEnd, \
    Group, SkipTo, ParserElement, Dict
import time

output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
  bundle-201.1,                 Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:

(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
  bundle-201.1                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
  lo5,                          Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
  bundle-1215,                  Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
  bundle-1218,                  Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:

(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
  bundle-201.1,                 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
  bundle-112,                   Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
  bundle-113.1,                 Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-114,                   Uptime: 02:04:41, status: Fwd, Flags: I
  bundle-201.2,                 Uptime: 02:04:41, status: Fwd, Flags: I

"""
# output = output * 5715
output = output * 1000

# ParserElement.enablePackrat()

ParserElement.setDefaultWhitespaceChars(" \t")

SkipToNL = Suppress(SkipTo(LineEnd()) + LineEnd())
IpAddress = Word(nums + '.')
ParserUptime = Word(nums + ':.')

SgLine = (Suppress(Literal('(')) + (IpAddress | Literal('*'))('source') + Suppress(Literal(',')) +
          IpAddress('group') + Literal(')') +
          Word(alphanums)('group_type') + Literal(',') +
          Keyword('Uptime:') + ParserUptime('group_uptime') +
          Optional(Keyword(', RP-Address:') + IpAddress('rp_address')) +
          SkipToNL)
SgFlagsLine = (Keyword('Upstream Join/Prune:') + SkipTo(",")('upstream_join_prune') + Literal(',') +
               Keyword('RPF:') + (IpAddress | Literal('*'))('rpf') + Literal(',') +
               Keyword('Flags:') + SkipTo(LineEnd())('sg_flags') + LineEnd())
IifStartLine = (Keyword('Incoming Interface:') + SkipToNL)
IifOilLine = (Word(alphanums + r'-./')('interface_name') + Optional(Literal(',')) +
              Keyword('Uptime:') + ParserUptime('uptime') + Literal(',') +
              Keyword('status:') + Word(alphanums) + Literal(',') +
              Optional(Word(alphanums + r'(): ')('join_prune_state') + Literal(',')) +
              Keyword('Flags:') + SkipTo(LineEnd())('interface_flags') + LineEnd())

IifLines = Dict(OneOrMore(Group(IifOilLine)))("iif")
OilLines = Dict(OneOrMore(Group(IifOilLine)))("oil")
OilStartLine = (Literal('Output Interface List:') + SkipToNL)

grammar = OneOrMore(Group((SgLine +
                           SgFlagsLine +
                           IifStartLine +
                           IifLines +
                           OilStartLine +
                           OilLines +
                           Optional(SkipToNL))
                          )
                    )

grammar.setDefaultWhitespaceChars(" \t")
print(f"started parse")
start_time = time.monotonic()
result = grammar.parseString(output)
end_time = time.monotonic()

print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('PyParsing')

Solution

    1. Primarily, pyparsing is slower because it is running in pure Python. Python's regex engine is implemented in C, so is inherently faster. Also, pyparsing's matching logic is broken up across many objects each with its own separate parse function to nibble away at the input string. re's implement their logic in a single C function call.
    2. I tried redoing a few of your low-level terms using pyparsing Regex instead of Word(word_chars) (though Word uses a regex internally anyway, so unlikely to gain much). I do note that your terms aren't really doing much pattern matching in their parsing - for instance using Word(nums+":") to parse a time given in the form 00:00:00 is a bit of cheating, since that term would also match "::::", ":0:0:", and any integer. Similar for defining IpAddress as any word composed of digits and ".". If the re's you are comparing to are as tolerant of badly formatted data, then I'm sure they will be fast.
    3. On my machine, pyparsing parses 1000 elements per second, so about 40 seconds for your list of 40,000 elements. If you are processing that router output once a day, 40 seconds seems fast enough. If you are doing it once a minute, then pyparsing will not be the right tool.