Search code examples
pythonparsingpyparsing

How to use pyparsing for multilined fields that has two different types of ending


As seen below, the repeated phrase starts with a dashed line. Then some key-values appear, and at last there is description with an unknown count of lines. And all ends with an EOF.

I've problem with the description. If it's not the last phrase, description ends with the start of a dashed line, but for last phrase it ends with an EOF.

So i'm quite confused about building a grammar for "description". What ways do you prefer/suggest for this kind of schema?

Thank you.

------
AAA: Value1
BBB: Value2

Description
Lorem ipsum dolor sit amet
consectetur adipiscing elit.
------
AAA: Value3
BBB: Value4
CCC: Value5
DDD: Value6

Description
In efficitur, turpis sit amet malesuada dignissim
Turpis nunc imperdiet ipsum, eu auctor leo arcu at libero
consectetur adipiscing elit.
------
AAA: Value7
BBB: Value
EEE: Value6

Description
In efficitur, turpis sit amet malesuada dignissim
Turpis nunc imperdiet ipsum, eu auctor leo arcu at libero

consectetur adipiscing elit
Lorem ipsum dolor sit amet.

Solution

  • See how msg_terminator is used in this sample code (needed in two places, once for the detection of the end of the repetition in the definition of msg, and once in the overall entry expr - so helpful to define as an expression on its own).

    I've also added some features of pyparsing in this example beyond the basics:

    • using ParserElement.set_default_whitespace_chars for a parser that has significant newlines
    • use of [...] for ZeroOrMore, and [...:expr] for ZeroOrMore with stop_on=expr
    • expr("name") for expr.set_results_name("name")
    • Dict to auto-name contained groups of key-value expressions
    • using pp.common expressions to parse a timestamp and convert to a python datetime
    • using pp.Empty to advance past optional whitespace

    I hope these help you in other parts of your parser.

    # https://stackoverflow.com/questions/75782477/how-to-use-pyparsing-for-multilined-fields-that-has-two-different-types-of-endin
    
    sample = """\
    timestamp: 2001-01-01 12:34
    Color: red
    msg
    Now is the Winter of our discontent
    Made glorious Summer by this sun of York.
    ---
    timestamp: 2001-01-01 12:34
    Color: mauve
    Material: poly-cotton
    msg
    Tomorrow and tomorrow and tomorrow
    Creeps in this petty pace from day to day.
    """
    
    import pyparsing as pp
    
    pp.ParserElement.set_default_whitespace_chars(" ")
    NL = pp.LineEnd().suppress()
    COLON = pp.Suppress(":")
    
    timestamp = pp.common.iso8601_datetime.add_parse_action(pp.common.convert_to_datetime("%Y-%m-%d %H:%M"))
    
    tag = pp.Group(pp.Word(pp.alphas, pp.alphanums)("tag")
                   + COLON
                   + pp.Empty()
                   + pp.rest_of_line("value")
                   )
    
    # look for terminating "---" OR the end of the string
    msg_terminator = ('---' + NL | pp.StringEnd()).suppress()
    
    msg = pp.Group(
        pp.Suppress("msg" + NL)
        # the following line is equivalent to
        # pp.ZeroOrMore(pp.rest_of_line + NL, stop_on=msg_terminator)
        + (pp.rest_of_line + NL)[...:msg_terminator]
    )
    
    entry_expr = pp.Group(
        pp.Suppress('timestamp:') + timestamp("timestamp") + NL
        + pp.Dict((tag + NL)[...])("tags")
        + msg("msg")
        + msg_terminator
    )
    
    for entry in entry_expr[...].parse_string(sample):
        print(entry.dump())
    

    Prints:

    [datetime.datetime(2001, 1, 1, 12, 34), [['Color', 'red']], ['Now is the Winter of our discontent', 'Made glorious Summer by this sun of York.']]
    - msg: ['Now is the Winter of our discontent', 'Made glorious Summer by this sun of York.']
    - tags: [['Color', 'red']]
      - Color: 'red'
      [0]:
        ['Color', 'red']
        - tag: 'Color'
        - value: 'red'
    - timestamp: datetime.datetime(2001, 1, 1, 12, 34)
    [0]:
      2001-01-01 12:34:00
    [1]:
      [['Color', 'red']]
      - Color: 'red'
      [0]:
        ['Color', 'red']
        - tag: 'Color'
        - value: 'red'
    [2]:
      ['Now is the Winter of our discontent', 'Made glorious Summer by this sun of York.']
    [datetime.datetime(2001, 1, 1, 12, 34), [['Color', 'mauve'], ['Material', 'poly-cotton']], ['Tomorrow and tomorrow and tomorrow', 'Creeps in this petty pace from day to day.']]
    - msg: ['Tomorrow and tomorrow and tomorrow', 'Creeps in this petty pace from day to day.']
    - tags: [['Color', 'mauve'], ['Material', 'poly-cotton']]
      - Color: 'mauve'
      - Material: 'poly-cotton'
      [0]:
        ['Color', 'mauve']
        - tag: 'Color'
        - value: 'mauve'
      [1]:
        ['Material', 'poly-cotton']
        - tag: 'Material'
        - value: 'poly-cotton'
    - timestamp: datetime.datetime(2001, 1, 1, 12, 34)
    [0]:
      2001-01-01 12:34:00
    [1]:
      [['Color', 'mauve'], ['Material', 'poly-cotton']]
      - Color: 'mauve'
      - Material: 'poly-cotton'
      [0]:
        ['Color', 'mauve']
        - tag: 'Color'
        - value: 'mauve'
      [1]:
        ['Material', 'poly-cotton']
        - tag: 'Material'
        - value: 'poly-cotton'
    [2]:
      ['Tomorrow and tomorrow and tomorrow', 'Creeps in this petty pace from day to day.']