I scanned many pages of documents, made them machine-readable using OCR, and then read them using the Tika package in Python 3, which returns one long messy string I labeled "fulltext." I am trying to return all text that matches this pattern:
DESCRIPTION OF INCIDENT: (bla bla bla) \n\n\n\nStudent
For reference, this is what the paragraph that I want to capture looks like:
Description of Incident: \nStudent did bla bla bla. Student \nbla bla bla bla. Bla bla bla bla \nbla bla. \n\n\n\n! \nI \n\n.\'=fll \nBLABLA \n\nSCHOOL \n\n\'1 \n\nWas the student bla and/ or bla? \nYesO No~ \nlfyes, attach Report. \n\nWas parent/ guardian notified within 24 hours if the incident? Yes:1 / \nNo: r:J· \n\nIs a bla bla bla bla? YesO Noif \n\nCC: \n\nDistrict Qt\'" . / \nParent/Guardian\'EJ \n0therO \n\nh[t/ I lf \nDate \n\n(pis\, 1.-1 \nDate \n\n\n\nStudent Name:
It always starts with "Description of Incident" and ends in "\n\n\n\nStudent". So I don't want to capture the part that says "\n\n\n\n!" in the middle.
I tried this:
desc = re.findall("Description of Incident:+.\n\n\n\n", fulltext)
print(desc)
But I get back an empty list.
However, if I do:
desc = re.findall("Description of Incident:+.", fulltext)
Then I get a list that repeats ['Description of Incident: '] multiple times
And if I do:
desc = re.findall("\n\n\n\n", fulltext)
Then I do get ['\n\n\n\n'] multiple times
Finally, if I do:
desc = re.findall("Description of Incident:.+\n.+", fulltext)
Then I get part of the paragraph but only up to the second \n. Example: ['Description of Incident: \nStudent did bla bla bla. Student ']
Using escape characters doesn't help.
Try running the find all search in DOT ALL mode, and also slightly change your pattern:
desc = re.findall("Description of Incident:.*?\n\n\n\n(?=Student\\b)", fulltext, re.DOTALL)
This appears to be working, at least with your sample input data. Note that the pattern now says to match and consume everything from Description of Incident:
across newlines, until reaching the first \n\n\n\n
which is then followed by (but does include) the text Student
.