Apologies in advance if this is somewhere else but I've been looking and I'm not good with regex. I'm using regex to compile sentences from a word document containing paragraphs. I need to get specifically the text between 2 indents, or if someone can help me figure out the current regex I have (which is shown later), then that will also work. For example, from the following text;
Here is the image as plain text, though I can't get the formatting the same:
A method, comprising:
storing a first data related to an operation style of a transport in a first area;
storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and
modifying functionality of the transport based on the combined energy consumption efficiency.
The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.
And here is the text that is actually read in from my function:
- A method, comprising: storing a first data related to an operation style of a transport in a first area; storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and modifying functionality of the transport based on the combined energy consumption efficiency.2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.
All of this is output into one line when I print the read in text from the .docx file
I need to extract the following lines:
storing a first data related to an operation style of a transport in a first area;
storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and
modifying functionality of the transport based on the combined energy consumption efficiency.
My current regex pattern is as follows:
pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")
As mentioned before, If someone can help me figure this regex out so that I read to either a semicolon or a period, then that would be great, otherwise, I understand that a part of my problem is that I have [ \t] as opposed to just [\t], however when I remove the space, I get no output. Additionally, the current regex is supposed to read to semi colons, but I am instead going to read to the next indent so that I can just parse the sentences afterwards and remove unnecessary information. If it helps at all, my current output is as shown here:
Here is just the raw text of output:
A method, comprising: storing a first data related to an operation style of a transport in a first storing a second data related to an operation style of the transport in a second first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second and modifying functionality of the transport based on the combined energy consumption method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular
Each line of text in the image is a single output from my code. Any text that isn't recognized from the original excerpt from the .docx is simply more of the text within the .docx file.
Finally, here is the code I'm currently working with:
def find_matches(text):
print(text)
pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")
return capitalize([m.group() for m in re.finditer(pattern, text)])
for match in find_matches(text=docText):
ct += 1
match_words = match.split(" ")
match = " ".join(match_words[:-1])
print(match)
So, all I need is some regex to read from intend to indent, and again, apologies if this is somewhere else, I simply could not find it.
I'm adding this bit as I've finally got some output with a regex pattern, however it seems to all be gibberish which I'm assuming is because of the encoding. Here is the code that I have to show this:
doc = open('P.docx', mode='r', encoding = "ISO-8859-1")
docText = doc.read()
pattern = r"^[^.;]*\s{2,}([^\s.;]*(?:\s+[^\s.;]+)+[.;])"
print(re.findall(pattern, docText, re.MULTILINE))
And this is just a bit (because there is a lot) of the output that I get from using this:
'½ú\x04Ü\x13\x8eÕ\nõ+;', '\x7fîÙ(\x11\x90\x85íÆ\x83Bs\x15Ü\xa0g\x03i\x00a\x070§¬gÃo\x18Ë\x9a\x81i[¡\x8eÃ{\x96FÃ9\x9f\x8aãð6°AÏ>ö·\x98+\x80e·!f\x8d\x0e{\x12W\x1eéÝ}iûͨ½niü>Ú¶mB¥»\tÜÀªÓÿº$í}b^3¢¡7\t\x1amwR\x19ò\x96\x83"Hf\x0fòÑ«NÀ=áXÝP½²£ç\x1a\x01ZÁÍEÃÌ4ÒÄ\x90-dÌìáy½Þ|yFÕ,4ýÂÍ.', "ð`\x9c\n\x99´-Á:bÒÒY²O\x86\x88\x06'\x93°Îx4û§'?Ì÷\xad\x00m{N¸r6a\x86×8Û\x9drâúÙÄ9\x85\x91\x0c-;",
You might start the match with 1 or more spaces or tabs, and capture what you want in a group.
^[ \t]+([^\s.;]+(?:\s+[^\s.;]+)*[.;])
^
Start of string[ \t]+
Match 1+ tabs or spaces(
Capture group 1
[^\s.;]+
Match 1+ non whitespace chars except .
or ;
(?:\s+[^\s.;]+)*
Optionally repeat matching 1+ whitespace chars and 1+ non whitespace chars except .
or ;
[.;]
Match either .
or ;
)
Close group 1Example
import re
from pprint import pprint
pattern = r"^[ \t]{2,}([^\s.;]+(?:\s+[^\s.;]+)+[.;])"
s = ("1. A method, comprising:\n\n"
" storing a first data related to an operation style of a transport in a first area;\n\n"
" storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and \n\n"
" modifying functionality of the transport based on the combined energy consumption efficiency.\n\n"
"2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws. \n\n"
"And here is the text that is actually read in from my function:\n\n"
"> 1. A method, comprising: storing a first data related to an operation style of a transport in a first area; storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and modifying functionality of the transport based on the combined energy consumption efficiency.2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.\n")
result = re.findall(pattern, s, re.MULTILINE)
pprint(result, width=100)
Output
['storing a first data related to an operation style of a transport in a first area;',
'storing a second data related to an operation style of the transport in a second area;',
'modifying functionality of the transport based on the combined energy consumption efficiency.']