Search code examples
pythonpython-3.xpdfpdf-reader

How to extract multiple instances of a word from PDF files on python?


I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.

I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.

How would I go about finding multiple instances of the word "time"?

This is my code:

import PyPDF2

def pdf_read():
    pdfFile = "records\document.pdf"
    
    pdf = PyPDF2.PdfFileReader(pdfFile)
    pageCount = pdf.getNumPages()
    
    for pageNumber in range(pageCount):
        page = pdf.getPage(pageNumber)
        pageContent = page.extractText()   
        if "Time" in pageContent or "time" in pageContent:
            print(pageNumber)

Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?


Solution

  • A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:

    import PyPDF2
    import string
    
    pdfFile = "records\document.pdf"
    
    pdf = PyPDF2.PdfFileReader(pdfFile)
    pageCount = pdf.getNumPages()
    
    for pageNumber in range(pageCount):
        page = pdf.getPage(pageNumber)
        pageContent = page.extractText()   
        pageContent = ''.join(pageContent.splitlines()).split() # words to list
        pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
    
        print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
        print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
    

    Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.