headers and footers in python-docx

I want to read header and footer text for a docx file in Python. I am using python-docx module.

I found this documentation - http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html

But I do not think it has been implemented yet. I also see that there is a "feature-headers" branch in github for python-docx - https://github.com/danmilon/python-docx/tree/feature-headers

Seems like this feature never got into master branch. Anyone used this feature? Can you help me on how to use it?

Thank you very much.

Solution

There is a better solution to this problem :

Method Used to extract

using MS XML Word document

just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text.

Following is the working code that extracts Header, Footer, Text Data from a docx file.

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
    paragraphs = []

    for xmlfile in contentToRead:
        xml_content = document.read('word/{}'.format(xmlfile))
        tree = XML(xml_content)
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                textData = ''.join(texts)
                if xmlfile == "footer2.xml":
                    extractedTxt = "Footer : " + textData
                elif xmlfile == "header2.xml":
                    extractedTxt = "Header : " + textData
                else:
                    extractedTxt = textData

                paragraphs.append(extractedTxt)
    document.close()
    return '\n\n'.join(paragraphs)


print(get_docx_text("E:\\path_to.docx"))