Search code examples
pythonpyparsing

Parsing text file in python using pyparsing


I am trying to parse the following text using pyparsing.

acp (SOLO1,
     "solo-100",
     "hi here is the gift"
     "Maximum amount of money, goes",
     430, 90)

jhk (SOLO2,
     "solo-101",
     "hi here goes the wind."
     "and, they go beyond",
     1000, 320)

I have tried the following code but it doesn't work.

flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)

print list(enclosed.searchString (str1))

The comma(,) within the quotation is producing undesired results.


Solution

  • Well, I might have oversimplified slightly in my comments - here is a more complete answer.

    If you don't really have to deal with nested data items, then a single-level parenthesized data group in each section will look like this:

    LPAR,RPAR = map(Suppress, "()")
    ident = Word(alphas, alphanums + "-_")
    integer = Word(nums)
    
    # treat consecutive quoted strings as one combined string
    quoted_string = OneOrMore(quotedString)
    # add parse action to concatenate multiple adjacent quoted strings
    quoted_string.setParseAction(lambda t: '"' + 
                                ''.join(map(lambda s:s.strip('"\''),t)) + 
                                '"' if len(t)>1 else t[0])
    data_item = ident | integer | quoted_string
    
    # section defined with no nesting
    section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
    

    I wasn't sure if it was intentional or not when you omitted the comma between two consecutive quoted strings, so I chose to implement logic like Python's compiler, in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is the same as "AB CD EF". This was done with the definition of quoted_string, and adding the parse action to quoted_string to concatenate the contents of the 2 or more component quoted strings.

    Finally, we create a parser for the overall group

    results = OneOrMore(Group(section)).parseString(source)
    results.pprint()
    

    and get from your posted input sample:

    [['acp',
      ['SOLO1',
       '"solo-100"',
       '"hi here is the giftMaximum amount of money, goes"',
       '430',
       '90']],
     ['jhk',
      ['SOLO2',
       '"solo-101"',
       '"hi here goes the wind.and, they go beyond"',
       '1000',
       '320']]]
    

    If you do have nested parenthetical groups, then your section definition can be as simple as this:

    # section defined with nesting
    section = ident + nestedExpr()
    

    Although as you have already found, this will retain the separate commas as if they were significant tokens instead of just data separators.