Search code examples
pythonpython-docx

How to detect an empty paragraph in python-docx


Given a document containing a paragraph

d = docx.Document()
p = d.add_paragraph()

I expected the following technique to work every time:

if len(p._element) == 0:
    # p is empty

OR

if len(p._p) == 0:
    # p is empty

(Side question, what's the difference there? It seems that p._p is p._element in every case I've seen in the wild.)

If I add a style to my paragraph, the check no longer works:

>>> p2 = d.add_paragraph(style="Normal")
>>> print(len(p2._element))
1

Explicitly setting text=None doesn't help either, not that I would expect it to.

So how to I check if a paragraph is empty of content (specifically text and images, although more generic is better)?

Update

I messed around a little and found that setting the style apparently adds a single pPr element:

>>> p2._element.getchildren()
[<CT_PPr '<w:pPr>' at 0x7fc9a2b64548>]

The element itself it empty:

>>> len(p2._element.getchildren()[0])
0

But more importantly, it is not a run.

So my test now looks like this:

def isempty(par):
    return sum(len(run) for run in par._element.xpath('w:r')) == 0

I don't know enough about the underlying system to have any idea if this is a reasonable solution or not, and what the caveats are.

More Update

Seems like I need to be able to handle a few different cases here:

def isempty(par):
    p = par._p
    runs = p.xpath('./w:r[./*[not(self::w:rPr)]]')
    others = p.xpath('./*[not(self::w:pPr) and not(self::w:r)] and '
                     'not(contains(local-name(), "bookmark"))')
    return len(runs) + len(others) == 0

This skips all w:pPr elements and runs with nothing but w:rPr elements. Any other element, except bookmarks, whether in the paragraph directly or in a run, will make the result non-empty.


Solution

  • The <w:p> element can have any of a large number of children, as you can see from the XML Schema excerpt here: http://python-docx.readthedocs.io/en/latest/dev/analysis/schema/ct_p.html (see the CT_P and EG_PContent definitions).

    In particular, it often has a w:pPr child, which is where the style setting goes.

    So your test isn't very reliable against false positives (if being empty is considered positive).

    I'd be inclined to use paragraph.text == '', which parses through the runs.

    A run can be empty (of text), so the mere presence of a run is not proof enough. The actual text is held in a a:t (text) element, which can also be empty. So the .text approach avoids all those low-level complications for you and has the benefit of being part of the API so much, much less likely to change in a future release.