Is PyParsing slower than RegEx because it creates objects instead of dicts? and if so, can this be improved?
I have a an output of almost 400,000 lines that describe 40,000 items in a table of a router.
I have 2 parsers, written in PyParsing and RegEx that do the same task. The difference in performance is around 1:15 to 1:18 in favor of RegEx and ParserElement.enablePackrat()
makes things worse.
I suspect that PyParsing is working "harder" because it generates objects, while RegEx generates dicts.
Did I miss something in the PyParsing grammar that makes it run slower than it should?
Is PyParsing intended for this output scale?
Footnote: I am using the parsers as part of an automation framework. My main goal is to provide users and future maintainers an easy to understand, use and maintainable code PyParsing allows for that and I prefer using it. In 99% of the cases, the amount of lines to parse is not this high, so PyParsing is the prefered tool. Following @PaulMcG 's reply, I will look into refining the parser.
RegEx
Output
started parse
finished parse, took 0.40858237000065856s, processed 7000 entries
RegEx
Code
import re
import time
output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
bundle-201.1, Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:
(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
bundle-201.1 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
lo5, Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
bundle-1215, Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
bundle-1218, Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:
(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
"""
# output = output * 5715
output = output * 1000
sg_line_pattern = re.compile(r'\((?P<source>[\d\.\*]+),'
r' (?P<group>[\d\.]+)\)'
r' (?P<group_type>\w+),'
r' Uptime: (?P<uptime>[\d:\.]+)'
r'(, RP-Address: (?P<rp_address>[\d\.]+))?')
join_prune_line_pattern = re.compile(r'Upstream Join/Prune: (?P<upstream_join_prune>.*?),'
r' RPF: (?P<rpf>[\d\.\*]+),'
r' Flags:(?P<flags>.*)')
iif_lines_pattern = re.compile(r'Incoming Interface:(?P<iif_lines>\n(.|\n)*?)Output Interface List:')
oil_lines_pattern = re.compile(r'Output Interface List:(?P<oil_lines>\n(.|\n)*)')
iif_oil_line_pattern = re.compile(r'\s*(?P<interface>[\w\d\-\./]+)(,)?'
r'\s+Uptime: (?P<uptime>[\d:\.]+),'
r' status: (?P<status>\w+),'
r'( (?P<join_prune_state>[\w\s\d():]+),)?'
r'\s+Flags:(?P<flags>.*)')
print(f"started parse")
start_time = time.monotonic()
output = output.split("\n\n")
output = [entry for entry in output if entry]
result = []
for entry in output:
iifs = None
oils = None
sg_line = re.search(sg_line_pattern, entry).groupdict()
join_prune_line = re.search(join_prune_line_pattern, entry).groupdict()
iif_lines = re.search(iif_lines_pattern, entry)
oil_lines = re.search(oil_lines_pattern, entry)
if iif_lines:
iif_lines = iif_lines.groupdict()
iifs = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, iif_lines['iif_lines'])]
iifs = {entry["interface"]: entry for entry in iifs}
if oil_lines:
oil_lines = oil_lines.groupdict()
oils = [m.groupdict() for m in re.finditer(iif_oil_line_pattern, oil_lines['oil_lines'])]
oils = {entry["interface"]: entry for entry in oils}
group = sg_line['group']
source = sg_line['source']
entry_dict = {**sg_line, **join_prune_line, "iifs": iifs, "oils": oils}
result.append(entry_dict)
end_time = time.monotonic()
print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('RegEx')
PyParsing
Output:
started parse
finished parse, took 1.3630201699997997s, processed 700 entries
PyParsing
Code:
from pyparsing import Word, Keyword, nums, OneOrMore, Optional, Suppress, Literal, alphanums, LineEnd, \
Group, SkipTo, ParserElement, Dict
import time
output = """(205.1.0.0, 225.1.0.0) SM, Uptime: 00:40:58
Upstream Join/Prune: Joined(HoldTime: 00:00:03), RPF: 2.1.0.1, Flags: KA(00:02:23), RR(00:01:58)
Incoming Interface:
bundle-201.1, Uptime: 00:40:58, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 00:40:36, status: Fwd, JOIN(HoldTime: 00:02:41), Flags:
(205.1.2.139, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
bundle-201.1 Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.140, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.141, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(205.1.2.142, 225.1.10.0) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
(*, 225.5.99.0) SM, Uptime: 00:44:41, RP-Address: 100.100.100.100
Upstream Join/Prune: Joined(HoldTime: 00:00:19), RPF: *, Flags:
Incoming Interface:
lo5, Uptime: 00:44:41, status: Rcv, Flags: R
Output Interface List:
bundle-1215, Uptime: 00:42:27, status: Fwd, JOIN(HoldTime: 00:03:14), Flags:
bundle-1218, Uptime: 00:44:41, status: Fwd, JOIN(HoldTime: 00:02:44), Flags:
(205.1.2.142, 225.1.10.1) SM, Uptime: 02:04:41
Upstream Join/Prune: Joined(HoldTime: 00:00:20), RPF: 2.1.0.1, Flags: KA(00:03:01), RR(00:02:36)
Incoming Interface:
bundle-201.1, Uptime: 02:04:41, status: Rcv, Flags: S
Output Interface List:
bundle-112, Uptime: 02:02:45, status: Fwd, JOIN(HoldTime: 00:03:11), Flags:
bundle-113.1, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-114, Uptime: 02:04:41, status: Fwd, Flags: I
bundle-201.2, Uptime: 02:04:41, status: Fwd, Flags: I
"""
# output = output * 5715
output = output * 1000
# ParserElement.enablePackrat()
ParserElement.setDefaultWhitespaceChars(" \t")
SkipToNL = Suppress(SkipTo(LineEnd()) + LineEnd())
IpAddress = Word(nums + '.')
ParserUptime = Word(nums + ':.')
SgLine = (Suppress(Literal('(')) + (IpAddress | Literal('*'))('source') + Suppress(Literal(',')) +
IpAddress('group') + Literal(')') +
Word(alphanums)('group_type') + Literal(',') +
Keyword('Uptime:') + ParserUptime('group_uptime') +
Optional(Keyword(', RP-Address:') + IpAddress('rp_address')) +
SkipToNL)
SgFlagsLine = (Keyword('Upstream Join/Prune:') + SkipTo(",")('upstream_join_prune') + Literal(',') +
Keyword('RPF:') + (IpAddress | Literal('*'))('rpf') + Literal(',') +
Keyword('Flags:') + SkipTo(LineEnd())('sg_flags') + LineEnd())
IifStartLine = (Keyword('Incoming Interface:') + SkipToNL)
IifOilLine = (Word(alphanums + r'-./')('interface_name') + Optional(Literal(',')) +
Keyword('Uptime:') + ParserUptime('uptime') + Literal(',') +
Keyword('status:') + Word(alphanums) + Literal(',') +
Optional(Word(alphanums + r'(): ')('join_prune_state') + Literal(',')) +
Keyword('Flags:') + SkipTo(LineEnd())('interface_flags') + LineEnd())
IifLines = Dict(OneOrMore(Group(IifOilLine)))("iif")
OilLines = Dict(OneOrMore(Group(IifOilLine)))("oil")
OilStartLine = (Literal('Output Interface List:') + SkipToNL)
grammar = OneOrMore(Group((SgLine +
SgFlagsLine +
IifStartLine +
IifLines +
OilStartLine +
OilLines +
Optional(SkipToNL))
)
)
grammar.setDefaultWhitespaceChars(" \t")
print(f"started parse")
start_time = time.monotonic()
result = grammar.parseString(output)
end_time = time.monotonic()
print(f"finished parse, took {end_time - start_time}s, processed {len(result)} entries")
print('PyParsing')
Regex
instead of Word(word_chars)
(though Word uses a regex internally anyway, so unlikely to gain much). I do note that your terms aren't really doing much pattern matching in their parsing - for instance using Word(nums+":")
to parse a time given in the form 00:00:00
is a bit of cheating, since that term would also match "::::", ":0:0:", and any integer. Similar for defining IpAddress as any word composed of digits and ".". If the re's you are comparing to are as tolerant of badly formatted data, then I'm sure they will be fast.