Search code examples
pythonms-wordpython-docx

Can not get the text in ContentControl in Word by usting Python-docx


ContentControl Document

I am new to the Python and coding. New I have a problem and need your help. I tried to read a docx document by using Python-docx, but all of the text I wanted were in the ContentControl. When I try to print the text of the paragraph with a ContentControl, error occurs.

For exemple, I try to print the 1st paragraphe, using

import docx
doc= docx.Document("C:\ContentControl.docx")
p=doc.paragraphs
print(p[0].text)

then I get an error like:

UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 8: illegal multibyte sequence

So what should I do to get the text in ContentControl? Thanks a lot for your help!


Solution

  • You cannot, with Python-docx.

    If you check https://github.com/python-openxml/python-docx/blob/master/docx/oxml/text/paragraph.py – the code that reads paragraphs and their contents –, you can see that it only parses two sub-elements of <w:p>: its formatting from <w:pPr>, and its text runs from <w:r>. The contents of a text run is parsed with text/run.py, which iterates over its elements and stores data for rPr (local text run formatting), t (plain text itself), and tab (a literal Tab), and a handful more.

    But Word's "contentControl" is stored in another tag, which is not parsed!

    <w:p>  <!-- paragraph -->
      <w:r>  <!-- text runs -->
        <w:t>Editions&#160;:</w:t>  <!-- plain text -->
      </w:r>  <!-- end text run -->
      <w:sdt>
        <w:sdtPr>
        <w:sdtContent>   <!-- something else! -->
          <w:r>
            <w:t>Henry</w:t>
          </w:r>
        </w:sdtContent>
      </w:sdt>
      <w:r>  <!-- next text run; just a tab -->
        <w:tab/>
        <w:t xml:space="preserve"> </w:t>
      </w:r>  <!-- end of that text run -->
    </w:p>

    (from your sample document; some codes are elided for brevity)

    As you can see, the ContentControl data is inside a <w:sdt> tag, which in turn is a direct descendent of <w:p>. So the code to read its data should be in paragraph.py, but it is not.

    You can clone python-docx and add proper handling of <w:sdt> yourself (and here is all information you need for that), but it just may be easier to use Word itself, and use a VBA macro to convert these to plain text.


    By the way, your error code has nothing to do with this. The "offending" character is the non-breaking space in the "Editions" line, stored as &#160;. Your text decoder should really not have had any problem with it. The problem is likely caused by you using the gbk decoder instead of UTF-8. There are some Chinese characters in the document, but also written as decimal escaped Unicode characters; there are no non-ASCII characters.