Search code examples
pythonpostgresqlmulticore

beginner question about python multiprocessing?


I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.

I wish to know whether does multiprocessing speeds up the time required to do such tasks. I did a

multiprocessing.cpu_count

and it returns 8. I have tried something like

process = []
for i in range(4):
    if i == 3:
        limit = resultsSize - (3 * division)
    else:
        limit = division

    #limit and offset indicates the subset of records the function would fetch in the db
    p = Process(target=sub_table.processR,args=(limit,offset,i,))
    p.start()
    process.append(p)
    offset += division + 1

for po in process:
    po.join()

but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?


Solution

  • Here are a couple of questions:

    1. In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)

    2. It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?

    Hope this helps.