finding on which page a search string is located in a pdf document using python

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

More precise: I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .

Solution

Finding on which page a search string is located in a pdf document using python

PyPDF2

 # import packages
    import PyPDF2
    import re
    
    # open the pdf file
    object = PyPDF2.PdfFileReader(r"source_file_path")
    
    # get number of pages
    NumPages = object.getNumPages()
    
    # define keyterms
    String = "P4F-21B"
    
    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        ResSearch = re.search(String, Text)
        if ResSearch != None:
            print(ResSearch)
            print("Page Number" + str(i+1))

Output:

<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1

PyMuPDF

import fitz
import re

# load document
doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")

# define keyterms
String = "P4F-21B"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.get_text()
    if len(re.findall(String, text)) > 0:
        print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')