Search code examples
pythonpython-3.xdocxpython-docx

Python-docx Extracted String Missing a Word


I can't figure out why the word "Delaware" does not get extracted from the code below. Every other character gets extracted. Can anyone provide code that extracts the word "Delaware" from the Docx file below, without altering the file manually?

Input:

import docx
import io
import requests

url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)

for text in docx.Document(file).paragraphs:
    print(text.text)

Output:

APPLICABLE LAW This Agreement is to be construed and interpreted according to the laws of the State of , excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.

The weirdest part about it is that if I do anything to the word "Delaware" (ee.gg., bold/unbold, type over the word) in the document and then save it, the word "Delaware" is no longer missing the next time I run the code. However, just saving the file without altering the word does not fix the problem. You might say the solution is to manually alter the word, but in reality I am dealing with thousands of these documents and it doesn't make sense to manually alter every document one by one.

The answer at Missing document text when using python-docx appears to provide the reasoning for why this "Delaware" might not be extracted, but it does not provide a solution. Thanks.


Solution

  • I believe @smci is right. This is most likely explained by: Missing document text when using python-docx. However that does not provide a solution.

    I think our only alternative in this case is to fall back to reading the XML-file. Consider this function (simplified) from the webpage http://etienned.github.io/posts/extract-text-from-word-docx-simply/ for instance:

    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    import zipfile
    import io
    import requests    
    
    def get_docx_text(path):
        """Take the path of a docx file as argument, return the text in unicode."""
    
        WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
        PARA = WORD_NAMESPACE + 'p'
        TEXT = WORD_NAMESPACE + 't'
    
        document = zipfile.ZipFile(path)
        xml_content = document.read('word/document.xml')
        document.close()
        tree = XML(xml_content)
    
        paragraphs = []
        for paragraph in tree.getiterator(PARA):
            texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
            if texts:
                paragraphs.append(''.join(texts))
    
        return '\n\n'.join(paragraphs)
    
    url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
    file = io.BytesIO(requests.get(url).content)
    print(get_docx_text(file))
    

    And we get:

    APPLICABLE LAW
    
    This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions.  The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.