Search code examples
pythonparsingpyparsing

Pyparsing for Paragraphs


I have run into a slight problem with pyparsing that I can't seem to solve. I'd like to write a rule that will parse a multiline paragraph for me. The end goal is to end up with a recursive grammar that will parse something like:

Heading: awesome
    This is a paragraph and then
    a line break is inserted
    then we have more text

    but this is also a different line
    with more lines attached

    Other: cool
        This is another indented block
        possibly with more paragraphs

        This is another way to keep this up
        and write more things

    But then we can keep writing at the old level
    and get this

Into something like HTML: so maybe (of course with a parse tree, I can transform this to whatever format I like).

<Heading class="awesome">

    <p> This is a paragraph and then a line break is inserted and then we have more text </p>

    <p> but this is also a different line with more lines attached<p>

    <Other class="cool">
        <p> This is another indented block possibly with more paragraphs</p>
        <p> This is another way to keep this up and write more things</p>
    </Other>

    <p> But then we can keep writing at the old level and get this</p>
</Heading>

Progress

I have managed to get to the stage where I can parse the heading row, and an indented block using pyparsing. But I can't:

  • Define a paragraph as a multiple lines that should be joined
  • Allow a paragraph to be indented

An Example

Following from here, I can get the paragraphs to output to a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.

I believe a paragraph should be:

words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd

But this doesn't seem to work for me. Any ideas would be awesome :)


Solution

  • So I managed to solve this, for anybody who stumbles upon this in the future. You can define the paragraph like this. Although it is certainly not ideal, and doesn't exactly match the grammar that I described. The relevant code is:

    line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
    emptyline = ~line
    paragraph = OneOrMore(line) + emptyline
    paragraph.setParseAction(join_lines)
    

    Where join_lines is defined as:

    def join_lines(tokens):
        stripped = [t.strip() for t in tokens]
        joined = " ".join(stripped)
        return joined
    

    That should point you in the right direction if this matches your needs :) I hope that helps!

    A Better Empty Line

    The definition of empty line given above is definitely not ideal, and it can be improved dramatically. The best way I've found is the following:

    empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
    empty_line.setWhitespaceChars("")
    

    This allows you to have empty lines that are filled with spaces, without breaking the match.