I'm a complete pyparsing newbie, and am trying to parse a large file with multi-line blocks describing archive files and their contents.
I'm currently at the stage where I'm able to parse a single item (no starting newline, this hardcoded test data approximates reading in a real file):
import pyparsing as pp
one_archive = \
"""archive (
name "something wicked this way comes.zip"
file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
"""
pp.ParserElement.set_default_whitespace_chars(' \t')
EOL = pp.LineEnd().suppress()
start_of_archive_block = pp.LineStart() + pp.Keyword('archive (') + EOL
end_of_archive_block = pp.LineStart() + ')' + EOL
archive_filename = pp.LineStart() \
+ pp.Keyword('name').suppress() \
+ pp.Literal('"').suppress() \
+ pp.SkipTo(pp.Literal('"')).set_results_name("archive_name") \
+ pp.Literal('"').suppress() \
+ EOL
field_elem = pp.Keyword('name').suppress() + pp.SkipTo(pp.Literal(' size')).set_results_name("filename") \
^ pp.Keyword('size').suppress() + pp.SkipTo(pp.Literal(' date')).set_results_name("size") \
^ pp.Keyword('date').suppress() + pp.SkipTo(pp.Literal(' crc')).set_results_name("date") \
^ pp.Keyword('crc').suppress() + pp.SkipTo(pp.Literal(' )')).set_results_name("crc")
fields = field_elem * 4
filerow = pp.LineStart() \
+ pp.Literal('file (').suppress() \
+ fields \
+ pp.Literal(')').suppress() \
+ EOL
archive = start_of_archive_block.suppress() \
+ archive_filename \
+ pp.OneOrMore(pp.Group(filerow)) \
+ end_of_archive_block.suppress()
archive.parse_string(one_archive, parse_all=True)
The result is a ParseResults object with all the data I need from that single archive. (For some reason, the trailing newline in the input string causes no problems, despite me doing nothing to actively handle it.)
However, try as I might, I cannot get from this point to a point where I could parse the following, more realistic data. The new features I need to handle are:
file_metadata
block that starts the file (I don't need it in my parsing results, it can be skipped entirely)archive
itemsarchive
itemsrealistic_data = \
"""
file_metadata (
description: blah blah etc.
author: john doe
version: 0.99
)
archive (
name "something wicked this way comes.zip"
file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
archive (
name "naughty or nice.zip"
file ( name naughty.exe size 187232 date 2021/8/4 10:19:55 crc 638BC6AA )
file ( name nice.exe size 298234 date 2021/8/4 10:19:56 crc 99FD31AE )
file ( name whatever.jpg size 25603 date 2021/8/5 11:03:09 crc ABFAC314 )
)
"""
I've been semi-randomly trying a variety of things, but I have large fundamental gaps in my understanding of how pyparsing works, so they're not worth itemizing here. Someone who knows what they're doing can probably immediately see what to do here.
My ultimate goal is to parse all of these archive items and store them in a database.
What's the solution?
Two days later, I managed it. Something about pyparsing clicked in my brain in the interim and I figured out a much better, shorter and more "pyparsing native feeling" way of going about things.
Given this data, the file_metadata block of which I want to ignore, and parse all the later archive blocks one by one:
realistic_data = \
"""
file_metadata (
description: blah blah etc.
author: john doe
version: 0.99
)
archive (
name "something wicked this way comes.zip"
file ( name wicked.exe size 140084 date 2022/12/24 23:32:00 crc B2CF5E58 )
file ( name readme.txt size 1704 date 2022/12/24 23:32:00 crc 37F73AEE )
)
archive (
name "naughty or nice.zip"
file ( name naughty.exe size 187232 date 2021/8/4 10:19:55 crc 638BC6AA )
file ( name nice.exe size 298234 date 2021/8/4 10:19:56 crc 99FD31AE )
file ( name whatever.jpg size 25603 date 2021/8/5 11:03:09 crc ABFAC314 )
)
"""
This parses it correctly, with nice groupings, namings, and thanks to the generator-returning scanString
, ignores the metadata header and works with huge files:
import pyparsing as pp
LPAREN, RPAREN = map(pp.Suppress, map(pp.Literal, '()'))
# archive and file
BLOCKSTART = pp.Word(pp.alphas).suppress() + LPAREN
BLOCKEND = RPAREN
# archive name and all the field labels of a file
LABEL = pp.Word(pp.alphas).suppress()
# Name row
name = LABEL + pp.QuotedString(quote_char='"')('name')
# File row
value = pp.Word(pp.printables, excludeChars=' ')
datevalue = pp.DelimitedList(value, delim=' ', max=2, combine=True)
file = BLOCKSTART + pp.Group(LABEL + value('name') + LABEL + value('size') + LABEL + datevalue('date') + LABEL + value('crc')) + BLOCKEND
files = pp.OneOrMore(file)('files')
# Full archive block
archive = BLOCKSTART + name + files + BLOCKEND
# Returns a generator that produces the archive blocks one by one
archive.scanString(realistic_data)
Hopefully this will be of use to someone in a similar situation!