Search code examples
pythonxmldocxpython-docx

Splitting a docx by headings into separate files in Python


I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.

So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.

I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.

import glob
from docx import Document

headings = []
texts = []

def iter_headings(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Heading'):
            yield paragraph

def iter_text(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Normal'):
            yield paragraph

for name in glob.glob('/*.docx'):
    document = Document(name)
    for heading in iter_headings(document.paragraphs):
        headings.append(heading.text)
        for paragraph in iter_text(document.paragraphs):
            texts.append(paragraph.text)
    print(texts)

How do I extract the text and heading for each article?

This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.

https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png

I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.


Solution

  • I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:

    for paragraphs in iterate_document_sections(document.paragraphs):
        create_document_from_paragraphs(paragraphs)
    

    Then iterate_document_sections() would look something like:

    def iterate_document_sections(document):
        """Generate a sequence of paragraphs for each headed section in document.
    
        Each generated sequence has a heading paragraph in its first position, 
        followed by one or more body paragraphs.
        """
        paragraphs = [document.paragraphs[0]]
        for paragraph in document.paragraphs[1:]:
            if is_heading(paragraph):
                 yield paragraphs
                 paragraphs = [paragraph]
                 continue
            paragraphs.append(paragraph)
        yield paragraphs
    

    Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().

    Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).