I can't figure out why the word "Delaware" does not get extracted from the code below. Every other character gets extracted. Can anyone provide code that extracts the word "Delaware" from the Docx file below, without altering the file manually?
Input:
import docx
import io
import requests
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
for text in docx.Document(file).paragraphs:
print(text.text)
Output:
APPLICABLE LAW This Agreement is to be construed and interpreted according to the laws of the State of , excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.
The weirdest part about it is that if I do anything to the word "Delaware" (ee.gg., bold/unbold, type over the word) in the document and then save it, the word "Delaware" is no longer missing the next time I run the code. However, just saving the file without altering the word does not fix the problem. You might say the solution is to manually alter the word, but in reality I am dealing with thousands of these documents and it doesn't make sense to manually alter every document one by one.
The answer at Missing document text when using python-docx appears to provide the reasoning for why this "Delaware" might not be extracted, but it does not provide a solution. Thanks.
I believe @smci is right. This is most likely explained by: Missing document text when using python-docx. However that does not provide a solution.
I think our only alternative in this case is to fall back to reading the XML-file. Consider this function (simplified) from the webpage http://etienned.github.io/posts/extract-text-from-word-docx-simply/ for instance:
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
import io
import requests
def get_docx_text(path):
"""Take the path of a docx file as argument, return the text in unicode."""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
print(get_docx_text(file))
And we get:
APPLICABLE LAW
This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.