I have run into a slight problem with pyparsing that I can't seem to solve. I'd like to write a rule that will parse a multiline paragraph for me. The end goal is to end up with a recursive grammar that will parse something like:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
Into something like HTML: so maybe (of course with a parse tree, I can transform this to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
I have managed to get to the stage where I can parse the heading row, and an indented block using pyparsing. But I can't:
Following from here, I can get the paragraphs to output to a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.
I believe a paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But this doesn't seem to work for me. Any ideas would be awesome :)
So I managed to solve this, for anybody who stumbles upon this in the future. You can define the paragraph like this. Although it is certainly not ideal, and doesn't exactly match the grammar that I described. The relevant code is:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
Where join_lines
is defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
That should point you in the right direction if this matches your needs :) I hope that helps!
The definition of empty line given above is definitely not ideal, and it can be improved dramatically. The best way I've found is the following:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to have empty lines that are filled with spaces, without breaking the match.