Search code examples
pythonmultithreadingpython-multiprocessingconcurrent.futures

Multiprocessing for appending text extracted with a loop to a list in Python


As a Python (and programming) novice, I am trying to extract the text of thousands of PDFs into a file (or list, if better). The data is going to be used for content analysis at a later stage. I created a working function that iterates through all PDFs in a directory, extracts text using pdfplumber and appends it to a list.

I would now like to use multiprocessing in order to speed up what otherwise would be a horrendously lengthy process. However, I can't seem to figure out how to best implement a loop that appends to a list using parallel processes. Below is an attempt to use code from some tutorials using concurrent.futures with my function:

import pdfplumber
import os
import concurrent.futures

def pdfextractor(file):
    text = []
    for file in os.listdir("./processing/"):
        filename = os.fsdecode(file)
        if filename.endswith('.pdf'):
            with pdfplumber.open("./processing/" + file) as pdf:
                pdf_page = pdf.pages[0]
                single_page_text = pdf_page.extract_text()
                text.append(str(single_page_text))

if __name__ == "__main__":
    executor = concurrent.futures.ProcessPoolExecutor(4)
    futures = [executor.submit(pdfextractor, 'file')]
    concurrent.futures.wait(futures)

This results in several processes being started appending the same PDF text to the text list. Changing ProcessPoolExecutor to ThreadPoolExecutor yields the desired output of the function but no increase in speed.


Solution

  • After a lot of tinkering, I managed to figure this out. Here is the code which reduced the required time to run the function by more than half. I split my function in two to better fit concurrent.futures. extract_pdf goes through all pages of a PDF, extracts the text and appends it to a list. extract_all then iterates this process over an entire directory. The result is a nested list.

    def extract_pdf(filename, directory):
        filename = os.fsdecode(filename)
        allpages_text = []
        if filename.endswith('.pdf'):
            with pdfplumber.open(directory + filename) as pdf:
                for page in pdf.pages:
                    text = page.extract_text()
                    allpages_text.append(text)
                allpages_text = '-'.join(allpages_text)
                return allpages_text
    
    
    def extract_all(directory):
        data = []
    
        with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
    
            futures = {executor.submit(extract_pdf, filename, directory): filename for filename in os.listdir(directory)}
    
            for future in concurrent.futures.as_completed(futures):
                try:
                    result = future.result()
                    data.append(result)
                except Exception as exc:
                    print("There was an error. {}".format(exc))
        return data
    
    
    
    if __name__ == '__main__':
        directory = './test/'
    
        results = extract_all(directory)
    

    This website was very helpful for me to better understand what was going on with the concurrent.futures module: https://rednafi.github.io/digressions/python/2020/04/21/python-concurrent-futures.html