Search code examples
python-docx

headers and footers in python-docx


I want to read header and footer text for a docx file in Python. I am using python-docx module.

I found this documentation - http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html

But I do not think it has been implemented yet. I also see that there is a "feature-headers" branch in github for python-docx - https://github.com/danmilon/python-docx/tree/feature-headers

Seems like this feature never got into master branch. Anyone used this feature? Can you help me on how to use it?

Thank you very much.


Solution

  • There is a better solution to this problem :

    Method Used to extract

    using MS XML Word document

    just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text.

    Following is the working code that extracts Header, Footer, Text Data from a docx file.

    try:
        from xml.etree.cElementTree import XML
    except ImportError:
        from xml.etree.ElementTree import XML
    import zipfile
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    
    
    def get_docx_text(path):
        """
        Take the path of a docx file as argument, return the text in unicode.
        """
        document = zipfile.ZipFile(path)
        contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
        paragraphs = []
    
        for xmlfile in contentToRead:
            xml_content = document.read('word/{}'.format(xmlfile))
            tree = XML(xml_content)
            for paragraph in tree.getiterator(PARA):
                texts = [node.text
                         for node in paragraph.getiterator(TEXT)
                         if node.text]
                if texts:
                    textData = ''.join(texts)
                    if xmlfile == "footer2.xml":
                        extractedTxt = "Footer : " + textData
                    elif xmlfile == "header2.xml":
                        extractedTxt = "Header : " + textData
                    else:
                        extractedTxt = textData
    
                    paragraphs.append(extractedTxt)
        document.close()
        return '\n\n'.join(paragraphs)
    
    
    print(get_docx_text("E:\\path_to.docx"))