Search code examples
pythonregexpython-imaging-librarydocxpython-re

How do I extract specific lines from a string starting from a keyword and ending at a different keyword in python?


The goal of my code is to be able to take text from a word document and take lines for every instance that there is a keyword until the associated part number, so for example:

The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component,

Would become:

detecting, by a component in a transport, that another component has been removed 244C

In addition to this, I need to take that text, and center it within an image that I've created with my code. Here is my code:

import re
import time
import textwrap
from docx import Document
from PIL import Image, ImageFont, ImageDraw

doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)

for i, p in enumerate(docText):
    W, H = 300, 300
    body = Image.new('RGB', (W, H), (255, 255, 255))
    border = Image.new('RGB', (W + 2, H + 2), (0, 0, 0))
    border.save('border.png')
    body.save('body.png')
    patent = Image.open('border.png')
    patent.paste(body, (1, 1))
    draw = ImageDraw.Draw(patent)
    font = ImageFont.load_default()

    current_h, pad = 60, 20
    keywords = ['responsive', 'detecting', 'providing', 'Responsive', 'Detecting', 'Providing']
    pattern = re.compile('|'.join(keywords))
    parts = re.findall("\d{1,3}[C]", docText)
    print(parts)
    for keywords in textwrap.wrap(docText, width=50):
        line = keywords.encode('utf-8')
        w, h = draw.textsize(line, font=font)
        draw.text(((W-w)/2, current_h), line, (0, 0, 0), font=font)
        current_h += h + pad

    patent.save(f'patent_{i+1}_{time.strftime("%Y%m%d%H%M%S")}.png')

What my code currently does is print the the string that is the entirety of the text from the word document, and outputs an image of the entire text 500+ times, which Is the character count in of the string. Here is an example of one of my outputs:

Example Output

This output is repeated 500+ times. In addition to that, these get output in the run window:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. ['244C', '246C', '248C', '249C']

Except, that array that followed the paragraph is repeated 500+ times as well.

This is the word document that I'm reading from and converting into a single string:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

I currently want to know how to extract the specific lines from the string I made. The output should look like this--ignoring the boxes and the centering--I'm only looking to output those lines from the paragraph I gave:

Desired Out

Some pseudo code for this would be something like:

for keyword in docText:
     print({keyword, part number})

My current implementation is with docx, PIL and re, though I'm happy to use anything that will accomplish my goals. Anything helps!


Solution

  • So, after some help from an outside source I managed to get it all sorted out. Minus the code for outputting to images with centered text and all that, this is the code that works to solve my main issue:

    from docx import Document
    from PIL import Image, ImageFont, ImageDraw
    
    doc = Document('PatentDocument.docx')
    docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
    print(docText)
    
    
    def get(source, begin, end):
        try:
            start = source.index(len(begin)) + len(begin)
            finish = source.index(len(end), len(start))
            return source[start:finish]
        except ValueError:
            return ""
    
    
    def create_regex(keywords=('responsive', 'providing', 'detecting')):
        re.compile('([Rr]esponsive|[Pp]oviding|[Dd]etecting).*?(\\d{1,3}C)')
        regex = (
            "("
            + "|".join((f"[{k[0].upper()}{k[0].lower()}]{k[1:]}" for k in keywords))
            + ")"
            + ".*?(\\d{1,3}C)"
        )
        return re.compile(regex)
    
    
    def find_matches(text, keywords):
        return [m.group() for m in re.finditer(create_regex(keywords), text)]
    
    
    for match in find_matches(
        text=docText, keywords=("responsive", "detecting", "providing")
    ):
        print(match)
    

    So, from the source document:

    [0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

    I get the following output:

    [0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

    detecting, by a component in a transport, that another component has been removed 244C

    detecting, by the component, that a replacement component has been added in the transport 246C

    providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C

    responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C

    The string that's printed followed by the keyword strings have no spaces between them, but for ease of reading, I've separated them as such. Hope this can help someone else out!