I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.
I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
Then iterate_document_sections()
would look something like:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading()
and create_document_from_paragraphs()
.
Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections
).