I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks. I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?
Here are a couple of questions:
In your processR
function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.