Search code examples
pythonparsingpyparsing

Parsing several multi-line blocks with pyparsing


I'm a complete pyparsing newbie, and am trying to parse a large file with multi-line blocks describing archive files and their contents.

I'm currently at the stage where I'm able to parse a single item (no starting newline, this hardcoded test data approximates reading in a real file):

    import pyparsing as pp

    one_archive = \
"""archive (
    name "something wicked this way comes.zip"
    file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
    file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
"""
    pp.ParserElement.set_default_whitespace_chars(' \t')

    EOL = pp.LineEnd().suppress()
    start_of_archive_block = pp.LineStart() + pp.Keyword('archive (') + EOL
    end_of_archive_block = pp.LineStart() + ')' + EOL

    archive_filename = pp.LineStart() \
        + pp.Keyword('name').suppress() \
        + pp.Literal('"').suppress() \
        + pp.SkipTo(pp.Literal('"')).set_results_name("archive_name") \
        + pp.Literal('"').suppress() \
        + EOL

    field_elem = pp.Keyword('name').suppress() + pp.SkipTo(pp.Literal(' size')).set_results_name("filename") \
        ^ pp.Keyword('size').suppress() + pp.SkipTo(pp.Literal(' date')).set_results_name("size") \
        ^ pp.Keyword('date').suppress() + pp.SkipTo(pp.Literal(' crc')).set_results_name("date") \
        ^ pp.Keyword('crc').suppress() + pp.SkipTo(pp.Literal(' )')).set_results_name("crc")
    fields = field_elem * 4

    filerow = pp.LineStart() \
        + pp.Literal('file (').suppress() \
        + fields \
        + pp.Literal(')').suppress() \
        + EOL

    archive = start_of_archive_block.suppress() \
        + archive_filename \
        + pp.OneOrMore(pp.Group(filerow)) \
        + end_of_archive_block.suppress()

    archive.parse_string(one_archive, parse_all=True)

The result is a ParseResults object with all the data I need from that single archive. (For some reason, the trailing newline in the input string causes no problems, despite me doing nothing to actively handle it.)

However, try as I might, I cannot get from this point to a point where I could parse the following, more realistic data. The new features I need to handle are:

  • a single file_metadata block that starts the file (I don't need it in my parsing results, it can be skipped entirely)
  • multiple archive items
  • newlines between the archive items
realistic_data = \
"""
file_metadata (
    description: blah blah etc.
    author: john doe
    version: 0.99
)

archive (
    name "something wicked this way comes.zip"
    file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
    file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)

archive (
    name "naughty or nice.zip"
    file ( name naughty.exe size 187232 date 2021/8/4 10:19:55 crc 638BC6AA )
    file ( name nice.exe size 298234 date 2021/8/4 10:19:56 crc 99FD31AE )
    file ( name whatever.jpg size 25603 date 2021/8/5 11:03:09 crc ABFAC314 )
)
"""

I've been semi-randomly trying a variety of things, but I have large fundamental gaps in my understanding of how pyparsing works, so they're not worth itemizing here. Someone who knows what they're doing can probably immediately see what to do here.

My ultimate goal is to parse all of these archive items and store them in a database.

What's the solution?


Solution

  • Two days later, I managed it. Something about pyparsing clicked in my brain in the interim and I figured out a much better, shorter and more "pyparsing native feeling" way of going about things.

    Given this data, the file_metadata block of which I want to ignore, and parse all the later archive blocks one by one:

    realistic_data = \
    """
    file_metadata (
        description: blah blah etc.
        author: john doe
        version: 0.99
    )
    
    archive (
        name "something wicked this way comes.zip"
        file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
        file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
    )
    
    archive (
        name "naughty or nice.zip"
        file ( name naughty.exe size 187232 date 2021/8/4 10:19:55 crc 638BC6AA )
        file ( name nice.exe size 298234 date 2021/8/4 10:19:56 crc 99FD31AE )
        file ( name whatever.jpg size 25603 date 2021/8/5 11:03:09 crc ABFAC314 )
    )
    """
    

    This parses it correctly, with nice groupings, namings, and thanks to the generator-returning scanString, ignores the metadata header and works with huge files:

    import pyparsing as pp
    
    LPAREN, RPAREN = map(pp.Suppress, map(pp.Literal, '()'))
    # archive and file
    BLOCKSTART = pp.Word(pp.alphas).suppress() + LPAREN
    BLOCKEND = RPAREN
    # archive name and all the field labels of a file
    LABEL = pp.Word(pp.alphas).suppress()
    
    # Name row
    name = LABEL + pp.QuotedString(quote_char='"')('name')
    
    # File row
    value = pp.Word(pp.printables, excludeChars=' ')
    datevalue = pp.DelimitedList(value, delim=' ', max=2, combine=True)
    file = BLOCKSTART + pp.Group(LABEL + value('name') + LABEL + value('size') + LABEL + datevalue('date') + LABEL + value('crc')) + BLOCKEND
    files = pp.OneOrMore(file)('files')
    
    # Full archive block
    archive = BLOCKSTART + name + files + BLOCKEND
    
    # Returns a generator that produces the archive blocks one by one
    archive.scanString(realistic_data)
    

    Hopefully this will be of use to someone in a similar situation!