Search code examples
pythonapache-sparkparallel-processingpysparkdistribute

Distributing Python module - Spark vs Process Pools


I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:

fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']

for pdf in fileList:
    text = myModule.extractText(pdf) # Distribute this function
    # Do stuff with text

We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.

I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?


Solution

  • This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:

    input = sc.binaryFiles("/path/to/files")
    processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))
    

    As your data volume increases or you wish to increase your throughput you can simply add additional nodes.