Search code examples
pythonpyparsing

Exhaustively parse file for all matches


I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:

from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore

ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')

first_log_line = dt.setResultsName('datetime') +                    \
            log_level('log_level') +                                \
            guid('guid') +                                          \
            junk_data('junk') +                                     \
            package_name('package_name') +                          \
            Suppress(':') +                                         \
            restOfLine('message') +                                 \
            Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)

In my mind, the last two lines are sort of equivalent to

log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch

Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).


Solution

  • Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.

    import pyparsing as pp
    
    data = "AAA _AB BBB CCC"
    
    expr = pp.Word(pp.alphas)
    print(pp.OneOrMore(expr).parseString(data))
    

    Gives:

    ['AAA']
    

    At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:

    ['AAA']
    ['AB']  <- really wanted '_AB' here
    ['BBB']
    ['CCC']
    

    Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:

    # loop to find non-matching parts in data
    last_end = 0
    for t,s,e in expr.scanString(data):
        gap = data[last_end:s]
        print(s, ':', repr(gap))
        last_end = e
    

    Giving:

    0 : ''
    5 : ' _'  <-- AHA!!
    8 : ' '
    12 : ' '
    

    Here's another way to visualize this.

    # print markers where each match begins in input string
    markers = [' ']*len(data)
    for t,s,e in expr.scanString(data):
        markers[s] = '^'
    
    print(data)
    print(''.join(markers))
    

    Prints:

    AAA _AB BBB CCC
    ^    ^  ^   ^  
    

    Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.