Search code examples
pythonpdfiterationpypdfword-count

How to iterate through pdf files and find the occurrences of a same list of specific words in each file?


I need help to find a list of specific words in many pdf files using python. For example, I want to find the occurrences of words "design" and "process" in two pdf files.

The following is my code:

output = []
count = 0
for fp in os.listdir(path):

    pdfFileObj = open(os.path.join(path, fp), 'rb')
    reader = PdfReader(pdfFileObj)
    number_of_pages = len(reader.pages)
    
    for i in range(number_of_pages):
        page = reader.pages[i]

        output.append(page.extract_text())
        text = str(output)
       
    
    words = ['design','process']
    count = {}
    for elem in words:
        count[elem] = 0
            
    # Count occurences
    for i, el in enumerate(words):
        count[f'{words[i]}'] = text.count(el)
    
    print(count)

The code output is: {'design': 112, 'process': 31} {'design': 195, 'process': 56}

The first count is right, since the first pdf file does have 112 "design" and 31 "process". However, the second count is not right. There are 83 "design" and 25 "process" in the second pdf but the output values are much larger than them.

My expected output is: {'design': 112, 'process': 31} {'design': 83, 'process': 25}

I found that if the second count minus the first count (195-112 = 83, 56-31 = 25), then the values are correct. I don't know how to fix the code, could someone please help me? Thank you so much.


Solution

  • You neglected to reset the list output when you advance to the next file. As you point out, the second set of numbers is the expected counts plus the counts from the first file.

    Set output = [] at the top of the body of the main for-loop, not above it.