I am trying to reformat this .docx document using the python docx module. Each question ends with the specific expression "-- ans end --". I want to insert a page break after the expression with the following code:
import docx, re
from pathlib import Path
from docx.enum.text import WD_BREAK
filename = Path("DOCUMENT_NAME")
doc = docx.Document(filename)
for para in doc.paragraphs:
match = re.search(r"-- ans end --", para.text)
if match:
run = para.add_run()
run.add_break(WD_BREAK.PAGE)
After each page break there seems to be 2
which I tried to remove with:
para.text = para.text.strip("\n")
Striping the empty lines before adding the page break does nothing, while striping the empty lines after adding the page break removes the page break.
Please tell me how to eliminate or avoiding adding the 2 empty lines. Thanks.
Update:
The page break should be added to the start of the next paragraph/section instead of after -- ans end --
(the end of this section) as the page break creates a new line when it is added to the end of a paragraph (try it on Word). Therefore I used this:
run = para.runs[0]
run._element.addprevious(new_run_element)
new_run = Run(new_run_element, run._parent)
new_run.text = ""
new_run.add_break(WD_BREAK.PAGE)
to add a page break to the start of next paragraph instead, which does not create a new line.
Have you looked at the contents of your doc before and after altering it? eg.
for para in doc.paragraphs:
print(repr(para.text)) # the call to repr() makes your `\n`s show up
this is helpful for figuring out what is going on.
Prior to altering your doc, there are no \n
s with the --- ans end --
s, so it makes sense that stripping the empty lines before adding your page break doesn't do anything. Also, prior to stripping your doc, there is an empty string in a paragraph right after -- ans end --
:
'-- ans --'
'-- ans end --'
''
is what stuff looks like before you edit the doc. (Except there is one case where -- ans end --
is followed by two ''s, which is annoyingly different from all the others.)
After editing the doc, those sections look like this.
'-- ans end --\n'
''
When I run this code, as I mentioned in my comment above, the page break actually shows up in the wrong spot - right after --ans end --
instead of right before. I think that can be worked around in a fairly straightforward way, I'll leave it to you if you're also having that issue.
If you remove those '' paragraphs I think that solves your problem. It is annoying to remove a paragraph from a document, but see this GitHub answer for an incantation which does it.