Search code examples
pythonpyparsing

Tokenize nested expression but ignore quoted string with spaces


I am looking to pretty print the following string

r"file='//usr/env/0/test/0', name='test', msg=Test.Msg(type=String, bytes_=Bytes(value=b\" 0\x80\x00\x00y\x17\`\"))"

to

    file='//usr/env/0/test/0',
    name='test',
    msg=Test.Msg(
        type=String,
        bytes=Bytes(
            value=b\" 0\x80\x00\x00y\x17\`\""
        )
    )

To start off, I tried using pyparsing to tokenize the input

from pyparsing import *
content = r"(file='//usr/env/0/test/0', name='test', msg=Test.Msg(type=String, bytes_=Bytes(value=b\" 0\x80\x00\x00y\x17\`\")))"
expr     = nestedExpr( '(', ')', ignoreExpr=None)
result = expr.parseString(content)
result.pprint()

This gives me a nested list but the Byte array gets split up on whitespace

[["file='//usr/env/0/test/0',",
  "name='test',",
  'msg=Test.Msg',
  ['type=String,',
   'bytes_=Bytes',
   ['value=b\\"', '0\\x80\\x00\\x00y\\x17\\`\\"']]]]

Anyone know how I can delimit on comma to return the following instead?

[["file='//usr/env/0/test/0',",
  "name='test',",
  'msg=Test.Msg',
  ['type=String,',
   'bytes_=Bytes',
   ['value=b\\" 0\\x80\\x00\\x00y\\x17\\`\\"']]]]

Solution

  • To get the desired results, we need to define a content expression for the contents of your nested expression. The default contents is a any quoted string or space-delimited word. But I think your content is more like a comma-separated list.

    Pyparsing defines a comma_separated_list expression in pyparsing_common, but it won't work here because it doesn't understand that the ()s for the nested expression should not be part of the items in the comma-separated list. So we have to write a slightly modified version:

    from pyparsing import *
    content = r"""(file='//usr/env/0/test/0', name='test', msg=Test.Msg(type=String, bytes_=Bytes(value=b" 0\x80\x00\x00y\x17\`")))"""
    
    # slightly modified comma_separated_list from pyparsing_common
    commasepitem = (
            Combine(
                OneOrMore(
                    ~Literal(",")
                    + Word(printables, excludeChars="(),")
                    + Optional(White(" ") + ~FollowedBy(oneOf(", ( )")))
                )
            )
        )
    comma_separated_list = delimitedList(quotedString() | commasepitem)
    
    expr     = nestedExpr( '(', ')', content=comma_separated_list)
    
    result = expr.parseString(content)
    result.pprint(width=60)
    
    print(result.asList() == 
            [["file='//usr/env/0/test/0'",
              "name='test'",
              'msg=Test.Msg',
              ['type=String',
               'bytes_=Bytes',
               ['value=b" 0\\x80\\x00\\x00y\\x17\\`"']]]])
    

    prints:

    [["file='//usr/env/0/test/0'",
      "name='test'",
      'msg=Test.Msg',
      ['type=String',
       'bytes_=Bytes',
       ['value=b" 0\\x80\\x00\\x00y\\x17\\`"']]]]
    True