Search code examples
pythonextractdocx

Extract paragraph text in Python


How can I search a word document using python to extract the paragraph text after searching and matching the paragraph heading i.e. "1.2 Summary of Broadspectrum Offer".

i.e. see below for a doc example, i basically would like to get the following text "A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein. Please also find the cost breakdown "

1.  Executive Summary

1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..

1.2 Summary of Broadspectrum Offer

A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

note that the headings number change from doc to doc and do not want to rely on this, more so i want to rely on the search text in the heading

so far i can search the documents but just a start.

filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx"

from docx import Document

document = Document(filename1)
for paragraph in document.paragraphs:
    if 'Summary' in paragraph.text:
        print paragraph.text

Solution

  • Here's a preliminary solution (pending answers to my comments on your post above). This does not yet account for exclusion of additional paragraphs after the Summary of Broadspectrum Offer section. If that is needed, you will most likely need a small regex match to figure out if you've encountered another header section with a 1.3(etc.) and stop the comprehension if so. Let me know if this is a requirement.

    Edit: converted the print() from list comprehension method to standard for loop, in response to Anton vBR's comment below.

    from docx import Document
    
    document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx")
    
    # Find the index of the `Summary of Broadspectrum Offer` syntax and store it
    ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
    # Print the text for any element with an index greater than the index found in the list comprehension above
    if ind:
        for i, para in enumerate(document.paragraphs):
            if i > ind[0]:
                 print(para.text)    
    

    [print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]

    >> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. 
    Please refer to the various terms and conditions of our Offer as detailed herein.
    Please also find the cost breakdown 
    

    Also, here is another post that may help solution another approach, which is to detect a heading type using paragraph metadata: Extracting headings' text from word doc