Parameter sweep: Apparent exponential decay of improvement of processing time with added CPUs?

I recently ran a simple parameter sweep of a Python script I have written which includes mltiprocessing. I sequentially assessed N files (adding 20 files per iteration) across 4, 8, 12, and 16 CPUs and assessed how long the process took to run. I ran this on a node exclusive to my job (i.e. no other processes were competing for resources)

Results:

As you can see, the processing time improved with increasing CPUs. However, I did notice that the change in the improvement appeared to be following an exponential decay trend. I had a quick peek at the processing time per plate, and the exponential decay trend was more obvious:

The relevant part of the code is here:

list_of_sub_dfs = [group.reset_index(drop=True) for _, group in df.groupby('Unique_identifier')]

with multiprocessing.Pool(processes=int(args.processors)) as pool:
    partial_process_wrapper = partial(process_well, smoothing_factor=smoothing_fact,normalize=normalize_data)
    results = pool.map(partial_process_wrapper, list_of_sub_dfs)
    for original_curve_rows, sub_curve_rows, Tm_rows in results:
        all_original_curve_rows.extend(original_curve_rows)
        all_subcurves.extend(sub_curve_rows)
        all_tm_rows.extend(Tm_rows)

In short: The input data is broken into equal sized chunks (split by a unique identifier in the data) and fed into the process_well function which spits out several lists.

Could anyone explain why this trend is occurring? Is there anything I can do to make the difference in performance more linear with the number of CPUs used?

Thanks in advance for any help that you can provide :)

Solution

Q2 :
" Is there anything I can do to make the difference in performance more linear with the number of CPUs used? "

Yes, design the code ( The Process) so as to have minimal, best zero, SERIAL part of the End-to-End process-flow and do not add any add-on overhead costs. Then and only then your processing will follow the Amdahl's Law ( and only the atomicity-of-work will be your principal glass ceiling for performance boosting ) given all other resources do not block your process-flow ( which they do in real hardware -- number of mem-I/O channels being the first such performance blocker, though maskable to some extent by the NUMA-cache hierarchy. Yet the principle of scare resources block any further performance boosting is clear in this ).

Q1 :
" Could anyone explain why this trend is occurring? "

Well, the speedup scaling was first explained by Dr. Gene AMDAHL in early days of mainframe computing in the last Millenium. His initial formulation became known as the "Law of diminishing returns", saying in principle, that no matter how many resources you add to the process-flow, even if adding infinitely many such resources, the speedup boost of each new added resource is less and less, as finally adding zero new speedup ( as the effect of the ceiling of diminishing returns was met, where the processing time remains still bound to still at least the time of the SERIAL part of the flow of processing, adding more resources to PARALLEL part will not cut the still-SERIAL part shorter a single femtosecond ...

That simple - exact formulae are here and interactive tools to simulate / animate the effects of parameters are available, as often cited here