Search code examples
pythonregexpdfpdfminerpdf-scraping

How can i use regex in my pdfminer code to extract text between two headings?


I have several PDFs that i want to extract data from. I have managed to use the code below to extract all the data from the PDF however now i want to extract text between two different headings. I believe using regex is the best way to do this as the text between the two headings will vary but the two headings will remain the same for each PDF.

This is an example PDF: https://www.scribd.com/document/396797318/123

I want to extract all the text between heading "3. Induction Training" and "4. Corporate Training/Departmental Training"

The following code is what I am using to extract the data from the PDF:

def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)

    for page in PDFPage.get_pages(filepath, check_extractable=False):
        interpreter.process_page(page)

    text = retstr.getvalue()

    filepath.close()
    device.close()
    retstr.close()
    return text

if __name__ == "__main__":
    text = pdf_to_text("123.pdf")
    print(text)

What regex can i use to get the information i need?


Solution

  • Try Regex: (?<=3\. Induction Training\n).*(?=4\. Corporate Training\/Departmental Training)

    Demo