Consider the following program:
import Queue
from multiprocessing import Queue as Q
from multiprocessing import Process, Value, Array
from time import time
from numpy import sum, zeros
from numpy.random import normal
def enche(queue, cnts, tims, t0, proc):
while True:
try:
queue.get(False)
x = sum(normal(size = 1000))
cnts[proc] = cnts[proc] + 1
except Queue.Empty:
tims[proc] = time() - t0.value
break
def main(m, nth):
queue = Q()
tmp = [queue.put(0) for i in xrange(m)]
cnts = Array('i', zeros(nth, dtype = "int"))
tims = Array('d', zeros(nth))
t0 = Value('d', time())
procs = [ Process(target = enche , args = (queue,cnts, tims, t0,i)) for i in xrange(nth) ]
for p in procs:
p.start()
for p in procs:
p.join()
for k in xrange(nth):
print "%2.0f/100, %4.1f ms,"%(cnts[k]/float(m) * 100, tims[k]*1000),
print ""
if __name__ == "__main__":
for k in xrange(100):
main(100, 4)
In essence, the function main(m,nth)
should do a task - sum a thousand normal random numbers m times - and divide this task in nth processes.
I then count and print what percentage of the task was done by each process and how long each process lasted. This the typical results I get:
4/100, 1.6 ms, 3/100, 2.2 ms, 88/100, 10.9 ms, 5/100, 2.7 ms,
23/100, 3.6 ms, 1/100, 1.5 ms, 3/100, 2.0 ms, 73/100, 14.9 ms,
8/100, 3.9 ms, 1/100, 3.4 ms, 8/100, 5.3 ms, 83/100, 21.0 ms,
6/100, 4.1 ms, 0/100, 3.2 ms, 76/100, 19.0 ms, 18/100, 8.8 ms,
5/100, 1.7 ms, 90/100, 9.2 ms, 5/100, 2.4 ms, 0/100, 2.6 ms,
2/100, 1.5 ms, 8/100, 2.4 ms, 89/100, 8.9 ms, 1/100, 2.6 ms,
8/100, 2.3 ms, 11/100, 2.7 ms, 3/100, 2.4 ms, 78/100, 8.5 ms,
52/100, 5.8 ms, 11/100, 3.2 ms, 4/100, 2.6 ms, 33/100, 5.9 ms,
6/100, 2.3 ms, 8/100, 2.9 ms, 24/100, 4.3 ms, 62/100, 7.7 ms,
13/100, 3.0 ms, 11/100, 2.6 ms, 1/100, 2.5 ms, 75/100, 8.5 ms,
6/100, 2.1 ms, 3/100, 2.1 ms, 6/100, 2.7 ms, 85/100, 9.0 ms,
8/100, 2.3 ms, 3/100, 2.0 ms, 27/100, 4.4 ms, 62/100, 7.6 ms,
18/100, 3.2 ms, 0/100, 1.9 ms, 6/100, 3.1 ms, 76/100, 9.2 ms,
3/100, 1.9 ms, 19/100, 3.4 ms, 30/100, 6.0 ms, 48/100, 7.9 ms,
10/100, 2.4 ms, 0/100, 1.5 ms, 8/100, 2.8 ms, 82/100, 8.8 ms,
6/100, 1.8 ms, 36/100, 4.7 ms, 32/100, 4.7 ms, 26/100, 4.7 ms,
8/100, 2.0 ms, 46/100, 5.9 ms, 46/100, 5.9 ms, 0/100, 2.6 ms,
3/100, 1.6 ms, 54/100, 6.2 ms, 1/100, 2.4 ms, 42/100, 6.1 ms,
12/100, 2.6 ms, 8/100, 2.4 ms, 34/100, 5.3 ms, 46/100, 6.3 ms,
1/100, 1.4 ms, 9/100, 2.3 ms, 1/100, 2.3 ms, 89/100, 9.2 ms,
13/100, 2.6 ms, 5/100, 2.2 ms, 3/100, 2.4 ms, 79/100, 8.7 ms,
11/100, 2.3 ms, 7/100, 2.6 ms, 8/100, 2.9 ms, 74/100, 8.8 ms,
8/100, 1.8 ms, 9/100, 2.7 ms, 0/100, 1.8 ms, 83/100, 9.3 ms,
13/100, 2.4 ms, 4/100, 2.0 ms, 3/100, 2.5 ms, 80/100, 9.9 ms,
10/100, 2.3 ms, 72/100, 8.0 ms, 6/100, 2.6 ms, 12/100, 4.1 ms,
27/100, 4.3 ms, 0/100, 1.5 ms, 5/100, 2.5 ms, 68/100, 8.4 ms,
3/100, 1.5 ms, 8/100, 2.3 ms, 86/100, 8.3 ms, 3/100, 2.7 ms,
7/100, 1.9 ms, 10/100, 2.5 ms, 5/100, 2.5 ms, 78/100, 8.2 ms,
14/100, 2.5 ms, 3/100, 2.0 ms, 5/100, 2.5 ms, 78/100, 8.2 ms,
53/100, 7.4 ms, 3/100, 1.9 ms, 1/100, 2.4 ms, 43/100, 7.5 ms,
3/100, 1.7 ms, 9/100, 2.5 ms, 47/100, 6.0 ms, 41/100, 6.0 ms,
3/100, 1.5 ms, 37/100, 4.7 ms, 33/100, 4.7 ms, 27/100, 4.7 ms,
7/100, 1.8 ms, 36/100, 4.5 ms, 32/100, 4.5 ms, 25/100, 4.5 ms,
3/100, 1.4 ms, 51/100, 5.6 ms, 46/100, 5.6 ms, 0/100, 2.3 ms,
5/100, 1.8 ms, 9/100, 2.5 ms, 46/100, 5.8 ms, 40/100, 5.9 ms,
9/100, 2.0 ms, 3/100, 1.8 ms, 5/100, 2.3 ms, 83/100, 8.8 ms,
7/100, 2.1 ms, 3/100, 2.1 ms, 8/100, 2.9 ms, 82/100, 8.6 ms,
11/100, 2.3 ms, 4/100, 2.0 ms, 3/100, 2.5 ms, 82/100, 9.3 ms,
19/100, 2.9 ms, 9/100, 2.6 ms, 10/100, 3.0 ms, 62/100, 7.3 ms,
18/100, 2.8 ms, 16/100, 2.9 ms, 2/100, 2.5 ms, 64/100, 7.9 ms,
2/100, 1.5 ms, 51/100, 5.8 ms, 47/100, 5.9 ms, 0/100, 2.4 ms,
26/100, 4.0 ms, 8/100, 2.3 ms, 11/100, 2.9 ms, 55/100, 7.6 ms,
8/100, 2.1 ms, 3/100, 1.9 ms, 79/100, 9.2 ms, 10/100, 3.8 ms,
14/100, 2.5 ms, 4/100, 2.1 ms, 45/100, 5.6 ms, 37/100, 5.6 ms,
11/100, 2.2 ms, 9/100, 2.8 ms, 4/100, 2.4 ms, 76/100, 8.9 ms,
3/100, 1.4 ms, 4/100, 2.1 ms, 5/100, 2.4 ms, 88/100, 9.5 ms,
3/100, 1.4 ms, 7/100, 2.4 ms, 49/100, 5.9 ms, 41/100, 5.9 ms,
3/100, 1.4 ms, 3/100, 1.8 ms, 92/100, 9.3 ms, 2/100, 2.6 ms,
9/100, 2.3 ms, 16/100, 3.1 ms, 41/100, 5.5 ms, 34/100, 5.5 ms,
5/100, 1.7 ms, 8/100, 2.3 ms, 47/100, 5.8 ms, 40/100, 5.7 ms,
6/100, 1.7 ms, 5/100, 2.2 ms, 5/100, 2.4 ms, 84/100, 8.6 ms,
7/100, 2.0 ms, 6/100, 2.2 ms, 47/100, 5.9 ms, 40/100, 5.8 ms,
5/100, 1.9 ms, 32/100, 5.1 ms, 6/100, 2.7 ms, 57/100, 7.7 ms,
11/100, 2.4 ms, 14/100, 2.7 ms, 42/100, 5.3 ms, 33/100, 5.3 ms,
50/100, 6.6 ms, 12/100, 2.8 ms, 0/100, 1.9 ms, 38/100, 6.6 ms,
55/100, 7.0 ms, 1/100, 1.7 ms, 2/100, 2.2 ms, 42/100, 6.9 ms,
8/100, 1.9 ms, 48/100, 6.7 ms, 5/100, 2.4 ms, 39/100, 6.7 ms,
12/100, 2.3 ms, 18/100, 3.0 ms, 30/100, 5.6 ms, 40/100, 6.8 ms,
5/100, 1.7 ms, 3/100, 1.9 ms, 5/100, 2.5 ms, 87/100, 8.8 ms,
50/100, 6.7 ms, 13/100, 2.7 ms, 0/100, 2.0 ms, 37/100, 6.7 ms,
Why do I get such a poor division of labor? Why sometimes a process last for short times and do nothing and others do almost 90% of the job?
What should I do to improve this?
edit: extra info - I run this in a i7 CPU.
The task it does in parallel is rather small compared to process start overhead, so that one process can do many iterations before other processes get a chance to start running.
If the line:
x = sum(normal(size = 1000))
is changed to:
x = sum(normal(size = 100000))
This is what I get:
25/100, 366.2 ms, 25/100, 364.4 ms, 25/100, 364.1 ms, 25/100, 361.7 ms,
34/100, 362.1 ms, 0/100, 2.6 ms, 25/100, 357.6 ms, 41/100, 357.8 ms,
25/100, 365.2 ms, 26/100, 362.7 ms, 25/100, 366.0 ms, 24/100, 355.7 ms,
21/100, 360.3 ms, 21/100, 355.9 ms, 22/100, 362.3 ms, 36/100, 358.0 ms,