Search code examples
python-3.xpython-docx

How to extract text from under headings in a docx file using python


I am looking to extract text under headings in a docx file. The text structure kinda looks like this:

1. DESCRIPTION
  Some text here

2. TERMS AND SERVICES
 2.1 Some text here
 2.2 Some text here

3. PAYMENTS AND FEES
  Some text here

What I am looking for is something like this:

['1. DESCRIPTION','Some text here']
['2. TERMS AND SERVICES','2.1 Some text here 2.2 Some text here']
['3. PAYMENTS AND FEES', 'Some text here']

I have tried using python-docx library:

from docx import Document

document = Document('Test.docx')

def iter_headings(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Normal'):
            yield paragraph
for heading in iter_headings(document.paragraphs):
    print (heading.text)

The styles that I have differ between Normal, Body Text and Heading #. Like sometimes the headings are Normal and the text for that section is in Body Text style. Can someone please guide me in the right direction. Will really appreciate it.


Solution

  • You have a way for this.

    After extracting the contents, just mark the sections which have "Normal" case and "BOLD" as headings too. But you have to put this logic carefully in such a way that bold characters which are present inside normal paragraphs are not impacted i.e. (bold characters which are present inside a normal paragraph just to highlight an important term in that paragraph).

    You can do this by scanning through each paragraph, and then iterating through all runs of the paragraph to check if "All the runs in that paragraph are BOLD". So if all the runs in a particular "Normal" paragraph have their property as "BOLD", you can conclude that it is a "Heading".

    To apply the above logic, you can use the below code while iterating on the paragraphs of your document:

    #Iterate over paragraphs
    for paragraph in document.paragraphs:
        
        #Perform the below logic only for paragraph content which does not have it's native style as "Heading"
        if "Heading" not in paragraph.style.name:
    
            #Start of by initializing an empty string to store bold words inside a run
            runboldtext = ''
    
            # Iterate over all runs of the current paragraph and collect all the words which are bold into the varible "runboldtext"
            for run in paragraph.runs:                        
                if run.bold:
                    runboldtext = runboldtext + run.text
    
            # Now check if the value of "runboldtext" matches the entire paragraph text. If it matches, it means all the words in the current paragraph are bold and can be considered as a heading
            if runboldtext == str(paragraph.text) and runboldtext != '':
                print("Heading True for the paragraph: ",runboldtext)
                style_of_current_paragraph = 'Heading'