Search code examples
pythonpdfpypdf

finding on which page a search string is located in a pdf document using python


Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

More precise: I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .


Solution

  • Finding on which page a search string is located in a pdf document using python

    PyPDF2

     # import packages
        import PyPDF2
        import re
        
        # open the pdf file
        object = PyPDF2.PdfFileReader(r"source_file_path")
        
        # get number of pages
        NumPages = object.getNumPages()
        
        # define keyterms
        String = "P4F-21B"
        
        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            Text = PageObj.extractText()
            ResSearch = re.search(String, Text)
            if ResSearch != None:
                print(ResSearch)
                print("Page Number" + str(i+1))
    

    Output:

    <re.Match object; span=(57, 64), match='P4F-21B'>
    Page Number1
    

    PyMuPDF

    import fitz
    import re
    
    # load document
    doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")
    
    # define keyterms
    String = "P4F-21B"
    
    # get text, search for string and print count on page.
    for page in doc:
        text = ''
        text += page.get_text()
        if len(re.findall(String, text)) > 0:
            print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')