I am currently working on docx files and I am using the w:lastRenderedPageBreak as a marker for every page's content. It is necessary that I determine if a page has already ended.
My current code is like this:
from docx import Document
document = Document(file)
for p in document.paragraphs:
if 'lastRenderedPageBreak' in p._element.xml:
# do something
# rest of code here
Now the problem I encountered is that a docx file that has 4 pages only has 2 w:lastRenderedPageBreak tags. I tried opening the docx file and saving it but the w:lastRenderedPageBreak tags do not increase.
The only time that the w:lastRenderedPageBreak would properly show the page breaks is when I open the docx file and save it as an XML file.
Is there any way to skip the saving as XML part to properly see the lastrenderedpagebreaks while parsing the text and formatting using python-docx? I want to do it in python, win32com, or vba if possible.
Edit: The reason I want the w:lastRenderedPageBreak is I had issues when handling footnotes while parsing content as they were formatted the same way with normal text (problem with source and can't be fixed). The only difference is that they have a superscript number at the beginning. Here lies the need to determine if a page has already ended since currently if the script does not know if the page has already ended, it will continue to include the text from the next page into the footnote until it finds a w:lastRenderedPageBreak.
Ex: I want the docx's XML to change from this:
Footnote 1: Text here. \p Additional text here that belongs to footnote 1. Footnote 2: Text here. new page text starts here...
into this:
Footnote 1: Text here. \p Additional text here that belongs to footnote 1. Footnote 2: Text here. <w:lastRenderedPageBreak> new page text starts here...
All text are contained in frames so no need to worry about page size, orientation, and margin. It does not matter how the docx will look as long as end of page or beginning of new page could be marked in content or xml.
w:lastRenderedPageBreak
has too many limitations to be useful as an indicator of pagination:
If a document has never been rendered, there will be no w:lastRenderedPageBreak
elements.
If a document has been changed since being rendered, existing w:lastRenderedPageBreak
elements will be stale.
Rendering can depend upon characteristics of the target media.
Rendering can depend upon line- and page- breaking algorithms or details of their implementations.
Even if one can live with limitations #1 through #4, w:lastRenderedPageBreak
is has historically had reliability issues.
For further details, see: