I have this python code to calculate the nth factorial of n "1"s in a row. I've been able to optimize it very well, including adjusting it to run on all cores using the multiprocessing module. However I have noticed the 7th process (Which is the lower end of the values since I'm going from the top down) is significantly faster than the rest of the threads. Threads 0-6 take on average 32 seconds with n=11 where-as thread 7 only takes 12 seconds. I would've expecting a difference the larger the numbers themselves got, but I wouldn't expect an immediate wall with such a stark difference.
Is there something in my code that I have missed other than the calculation that causes this huge wall? I've verified the output and each segment is nearly identical length (thread 7 is slightly longer by a few dozen calculations, but in the grand scheme of things this is nothing and thread 7 is the shortest running anyways)
Is there a better way to parallelize this for better efficiency? Would making the threads not all the same increment be helpful?
Edit: Adding python version information
Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
(I did 25 tests of n=11, all very similar to this run)
import multiprocessing
import argparse
from datetime import datetime
from math import log10
parser = argparse.ArgumentParser(
formatter_class=argparse.HelpFormatter,
description="Calcs n factorial",
usage=""
)
parser.add_argument("-n", "--number", type=int, default=2)
args = parser.parse_args()
def getlog(send_end, i, threads, num, n, inc):
begin = datetime.now()
start = num-inc*i
end = num-inc*(i+1) if i < threads-1 else 0
output = sum(map(log10, range(start, end, -n)))
send_end.send(output)
final = datetime.now()
duration = final-begin
print("{},{},{},{}".format(i, duration, start, end))
def main():
n = args.number
num = int('1'*n)
threads = multiprocessing.cpu_count() if num/multiprocessing.cpu_count() > multiprocessing.cpu_count() else 1
inc = int(num/threads)
inc -= inc%n
jobs = []
pipe_list = []
for i in range(threads):
recv_end, send_end = multiprocessing.Pipe(False)
p = multiprocessing.Process(target=getlog, args=(send_end, i, threads, num, n, inc))
jobs.append(p)
pipe_list.append(recv_end)
p.start()
for proc in jobs:
proc.join()
e = sum([output.recv() for output in pipe_list])
print('%.2fe%d' % (10**(e % 1), e // 1))
if __name__ == '__main__':
start = datetime.now()
main()
end = datetime.now()
print(end-start)
range
uses a slower implementation if it needs to work with values beyond the range of a C long
- see the source.
You're on Windows, where C long
is 32-bit (even on a 64-bit Python build). Process 7 is the only one where the range elements fit within the bounds of a C long
.