Search code examples
regexpython-3.xapache-tika

Using regex with text string from pdf read by Tika in python - trying to find line that ends in \n\n\n\n


I scanned many pages of documents, made them machine-readable using OCR, and then read them using the Tika package in Python 3, which returns one long messy string I labeled "fulltext." I am trying to return all text that matches this pattern:

DESCRIPTION OF INCIDENT: (bla bla bla) \n\n\n\nStudent

For reference, this is what the paragraph that I want to capture looks like:

Description of Incident: \nStudent did bla bla bla. Student \nbla bla bla bla. Bla bla bla bla \nbla bla. \n\n\n\n! \nI \n\n.\'=fll \nBLABLA \n\nSCHOOL \n\n\'1 \n\nWas the student bla and/ or bla? \nYesO No~ \nlfyes, attach Report. \n\nWas parent/ guardian notified within 24 hours if the incident? Yes:1 / \nNo: r:J· \n\nIs a bla bla bla bla? YesO Noif \n\nCC: \n\nDistrict Qt\'" . / \nParent/Guardian\'EJ \n0therO \n\nh[t/ I lf \nDate \n\n(pis\, 1.-1 \nDate \n\n\n\nStudent Name:

It always starts with "Description of Incident" and ends in "\n\n\n\nStudent". So I don't want to capture the part that says "\n\n\n\n!" in the middle.

I tried this:

    desc = re.findall("Description of Incident:+.\n\n\n\n", fulltext)
    print(desc)

But I get back an empty list.

However, if I do:

    desc = re.findall("Description of Incident:+.", fulltext)

Then I get a list that repeats ['Description of Incident: '] multiple times

And if I do:

    desc = re.findall("\n\n\n\n", fulltext)

Then I do get ['\n\n\n\n'] multiple times

Finally, if I do:

    desc = re.findall("Description of Incident:.+\n.+", fulltext)

Then I get part of the paragraph but only up to the second \n. Example: ['Description of Incident: \nStudent did bla bla bla. Student ']

Using escape characters doesn't help.


Solution

  • Try running the find all search in DOT ALL mode, and also slightly change your pattern:

    desc = re.findall("Description of Incident:.*?\n\n\n\n(?=Student\\b)", fulltext, re.DOTALL)
    

    This appears to be working, at least with your sample input data. Note that the pattern now says to match and consume everything from Description of Incident: across newlines, until reaching the first \n\n\n\n which is then followed by (but does include) the text Student.