Better way to parallelize my nth factorial python program?

I have this python code to calculate the nth factorial of n "1"s in a row. I've been able to optimize it very well, including adjusting it to run on all cores using the multiprocessing module. However I have noticed the 7th process (Which is the lower end of the values since I'm going from the top down) is significantly faster than the rest of the threads. Threads 0-6 take on average 32 seconds with n=11 where-as thread 7 only takes 12 seconds. I would've expecting a difference the larger the numbers themselves got, but I wouldn't expect an immediate wall with such a stark difference.

Is there something in my code that I have missed other than the calculation that causes this huge wall? I've verified the output and each segment is nearly identical length (thread 7 is slightly longer by a few dozen calculations, but in the grand scheme of things this is nothing and thread 7 is the shortest running anyways)

Is there a better way to parallelize this for better efficiency? Would making the threads not all the same increment be helpful?

Edit: Adding python version information

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32

(I did 25 tests of n=11, all very similar to this run)

import multiprocessing
import argparse
from datetime import datetime
from math import log10

parser = argparse.ArgumentParser(
    formatter_class=argparse.HelpFormatter,
    description="Calcs n factorial",
    usage=""
)

parser.add_argument("-n", "--number", type=int, default=2)

args = parser.parse_args()

def getlog(send_end, i, threads, num, n, inc):
    begin = datetime.now()
    start = num-inc*i
    end = num-inc*(i+1) if i < threads-1 else 0
    output = sum(map(log10, range(start, end, -n)))
    send_end.send(output)
    final = datetime.now()
    duration = final-begin
    print("{},{},{},{}".format(i, duration, start, end))

def main():
    n = args.number
    num = int('1'*n)
    threads = multiprocessing.cpu_count() if num/multiprocessing.cpu_count() > multiprocessing.cpu_count() else 1
    inc = int(num/threads)
    inc -= inc%n
    jobs = []
    pipe_list = []
    for i in range(threads):
        recv_end, send_end = multiprocessing.Pipe(False)
        p = multiprocessing.Process(target=getlog, args=(send_end, i, threads, num, n, inc))
        jobs.append(p)
        pipe_list.append(recv_end)
        p.start()
    for proc in jobs:
        proc.join()
    e = sum([output.recv() for output in pipe_list])

    print('%.2fe%d' % (10**(e % 1), e // 1))
    
if __name__ == '__main__':
    start = datetime.now()
    main()
    end = datetime.now()
    print(end-start)

Solution

range uses a slower implementation if it needs to work with values beyond the range of a C long - see the source.

You're on Windows, where C long is 32-bit (even on a 64-bit Python build). Process 7 is the only one where the range elements fit within the bounds of a C long.