Given a document containing a paragraph
d = docx.Document()
p = d.add_paragraph()
I expected the following technique to work every time:
if len(p._element) == 0:
# p is empty
OR
if len(p._p) == 0:
# p is empty
(Side question, what's the difference there? It seems that p._p is p._element
in every case I've seen in the wild.)
If I add a style to my paragraph, the check no longer works:
>>> p2 = d.add_paragraph(style="Normal")
>>> print(len(p2._element))
1
Explicitly setting text=None
doesn't help either, not that I would expect it to.
So how to I check if a paragraph is empty of content (specifically text and images, although more generic is better)?
Update
I messed around a little and found that setting the style apparently adds a single pPr
element:
>>> p2._element.getchildren()
[<CT_PPr '<w:pPr>' at 0x7fc9a2b64548>]
The element itself it empty:
>>> len(p2._element.getchildren()[0])
0
But more importantly, it is not a run.
So my test now looks like this:
def isempty(par):
return sum(len(run) for run in par._element.xpath('w:r')) == 0
I don't know enough about the underlying system to have any idea if this is a reasonable solution or not, and what the caveats are.
More Update
Seems like I need to be able to handle a few different cases here:
def isempty(par):
p = par._p
runs = p.xpath('./w:r[./*[not(self::w:rPr)]]')
others = p.xpath('./*[not(self::w:pPr) and not(self::w:r)] and '
'not(contains(local-name(), "bookmark"))')
return len(runs) + len(others) == 0
This skips all w:pPr
elements and runs with nothing but w:rPr
elements. Any other element, except bookmarks, whether in the paragraph directly or in a run, will make the result non-empty.
The <w:p>
element can have any of a large number of children, as you can see from the XML Schema excerpt here: http://python-docx.readthedocs.io/en/latest/dev/analysis/schema/ct_p.html (see the CT_P and EG_PContent definitions).
In particular, it often has a w:pPr
child, which is where the style setting goes.
So your test isn't very reliable against false positives (if being empty is considered positive).
I'd be inclined to use paragraph.text == ''
, which parses through the runs.
A run can be empty (of text), so the mere presence of a run is not proof enough. The actual text is held in a a:t
(text) element, which can also be empty. So the .text
approach avoids all those low-level complications for you and has the benefit of being part of the API so much, much less likely to change in a future release.