Search code examples
pythonregexdocxpython-re

Regex to compile all text between tabs/indents within paragraphs in Python


Apologies in advance if this is somewhere else but I've been looking and I'm not good with regex. I'm using regex to compile sentences from a word document containing paragraphs. I need to get specifically the text between 2 indents, or if someone can help me figure out the current regex I have (which is shown later), then that will also work. For example, from the following text;

Input Document

Here is the image as plain text, though I can't get the formatting the same:

  1. A method, comprising:

    storing a first data related to an operation style of a transport in a first area;

    storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and

    modifying functionality of the transport based on the combined energy consumption efficiency.

  2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.

And here is the text that is actually read in from my function:

  1. A method, comprising:            storing a first data related to an operation style of a transport in a first area;             storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and             modifying functionality of the transport based on the combined energy consumption efficiency.2.     The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.

All of this is output into one line when I print the read in text from the .docx file

I need to extract the following lines:

storing a first data related to an operation style of a transport in a first area;

storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and

modifying functionality of the transport based on the combined energy consumption efficiency.

My current regex pattern is as follows:

pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")

As mentioned before, If someone can help me figure this regex out so that I read to either a semicolon or a period, then that would be great, otherwise, I understand that a part of my problem is that I have [ \t] as opposed to just [\t], however when I remove the space, I get no output. Additionally, the current regex is supposed to read to semi colons, but I am instead going to read to the next indent so that I can just parse the sentences afterwards and remove unnecessary information. If it helps at all, my current output is as shown here:

Output

Here is just the raw text of output:

A method, comprising:            storing a first data related to an operation style of a transport in a first storing a second data related to an operation style of the transport in a second first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second and             modifying functionality of the transport based on the combined energy consumption method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular

Each line of text in the image is a single output from my code. Any text that isn't recognized from the original excerpt from the .docx is simply more of the text within the .docx file.

Finally, here is the code I'm currently working with:

def find_matches(text):
    print(text)
    pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")
    return capitalize([m.group() for m in re.finditer(pattern, text)])


for match in find_matches(text=docText):
    ct += 1
    match_words = match.split(" ")
    match = " ".join(match_words[:-1])
    print(match)

So, all I need is some regex to read from intend to indent, and again, apologies if this is somewhere else, I simply could not find it.

I'm adding this bit as I've finally got some output with a regex pattern, however it seems to all be gibberish which I'm assuming is because of the encoding. Here is the code that I have to show this:

doc = open('P.docx', mode='r', encoding = "ISO-8859-1")
docText = doc.read()
pattern = r"^[^.;]*\s{2,}([^\s.;]*(?:\s+[^\s.;]+)+[.;])"
print(re.findall(pattern, docText, re.MULTILINE))

And this is just a bit (because there is a lot) of the output that I get from using this:

'½ú\x04Ü\x13\x8eÕ\nõ+;', '\x7fîÙ(\x11\x90\x85íÆ\x83Bs\x15Ü\xa0g\x03i\x00a\x070§¬gÃo\x18Ë\x9a\x81i[¡\x8eÃ{\x96FÃ9\x9f\x8aãð6°AÏ>ö·\x98+\x80e·!f\x8d\x0e{\x12W\x1eéÝ}iûͨ½niü>Ú¶mB¥»\tÜÀªÓÿº$í}b^3¢¡7\t\x1amwR\x19ò\x96\x83"Hf\x0fòÑ«NÀ=áXÝP½²£ç\x1a\x01ZÁÍEÃÌ4ÒÄ\x90-dÌìáy½Þ|yFÕ,4ýÂÍ.', "ð`\x9c\n\x99´-Á:bÒÒY²O\x86\x88\x06'\x93°Îx4û§'?Ì÷\xad\x00m{N¸r6a\x86×8Û\x9drâúÙÄ9\x85\x91\x0c-;",


Solution

  • You might start the match with 1 or more spaces or tabs, and capture what you want in a group.

    ^[ \t]+([^\s.;]+(?:\s+[^\s.;]+)*[.;])
    
    • ^ Start of string
    • [ \t]+ Match 1+ tabs or spaces
    • ( Capture group 1
      • [^\s.;]+ Match 1+ non whitespace chars except . or ;
      • (?:\s+[^\s.;]+)* Optionally repeat matching 1+ whitespace chars and 1+ non whitespace chars except . or ;
      • [.;] Match either . or ;
    • ) Close group 1

    Regex demo | Python demo

    Example

    import re
    from pprint import pprint
    pattern = r"^[ \t]{2,}([^\s.;]+(?:\s+[^\s.;]+)+[.;])"
    
    s = ("1. A method, comprising:\n\n"
         "      storing a first data related to an operation style of a transport in a first area;\n\n"
         "     storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and \n\n"
         "     modifying functionality of the transport based on the combined energy consumption efficiency.\n\n"
         "2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws. \n\n"
         "And here is the text that is actually read in from my function:\n\n"
         "> 1. A method, comprising:            storing a first data related to an operation style of a transport in a first area;             storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and             modifying functionality of the transport based on the combined energy consumption efficiency.2.     The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.\n")
    
    result = re.findall(pattern, s, re.MULTILINE)
    pprint(result, width=100)
    

    Output

    ['storing a first data related to an operation style of a transport in a first area;',
     'storing a second data related to an operation style of the transport in a second area;',
     'modifying functionality of the transport based on the combined energy consumption efficiency.']