I want to read header and footer text for a docx file in Python. I am using python-docx module.
I found this documentation - http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
But I do not think it has been implemented yet. I also see that there is a "feature-headers" branch in github for python-docx - https://github.com/danmilon/python-docx/tree/feature-headers
Seems like this feature never got into master branch. Anyone used this feature? Can you help me on how to use it?
Thank you very much.
There is a better solution to this problem :
Method Used to extract
using MS XML Word document
just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text.
Following is the working code that extracts Header, Footer, Text Data from a docx file.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
paragraphs = []
for xmlfile in contentToRead:
xml_content = document.read('word/{}'.format(xmlfile))
tree = XML(xml_content)
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
textData = ''.join(texts)
if xmlfile == "footer2.xml":
extractedTxt = "Footer : " + textData
elif xmlfile == "header2.xml":
extractedTxt = "Header : " + textData
else:
extractedTxt = textData
paragraphs.append(extractedTxt)
document.close()
return '\n\n'.join(paragraphs)
print(get_docx_text("E:\\path_to.docx"))