What are the theoretical upper limits on parallelizability?

I am wondering about the theoretical upper limits on parallelization of computation.

Let's say we have a process that requires time T to complete with 1 core.

If the process is perfectly unparallelizable due to a serial bottleneck (commonly, reading from disk), it will take T seconds no matter how may cores we use.
If the process is perfectly parallelizable, we could complete it in T/K seconds using K cores.
If we have N equivalent processes to complete, where N >> K, then we should run multiple processes in parallel rather than parallelizing any given process. Perfectly parallelizable processes will take NT/K time in either case, but jobs that contain a serial bottleneck could could take up to NT time.

In theory, are there any "super-parallelizable" computational workloads that will require less than T/K seconds each to complete with K cores? In other words, workloads that actually require less aggregate total time when more cores are available. In this case, it would actually be preferable to parallelize each job in the last case with N processes.

Solution

Yes, and it's a big and important class of problems: those which are I/O bound.

Imagine you have 10,000 instances of a process and each instance takes around 9 seconds of I/O and 1 second of CPU/processing time. Now say you have a thousand processors. If you just distribute the 10,000 problem instances across the 1,000 processors, the makespan (total time to complete all work across all processors) will take about 100 seconds (10 processes on each processor, 10 seconds each, 100 seconds each processor). If you instead run all 10,000 processes in parallel you'd see a makespan of something closer to 10 seconds.

This works intra-process as well. Imagine a process that has 10 phases that are all I/O bound and completely independent. If each phase takes 10 seconds then the total time to run the process on one core is 100 seconds. If you parallelize this across 10 cores then the time will be closer to 10 seconds. This might be a good candidate for your "super-parallelizable" process which would benefit not only from distributing processes to different machines, but also parallelizing individual processes.

So, one scenario is you have a large number of jobs which each perform a large number of unrelated I/O bound operations. If you run everything sequentially then you'll spend the vast majority of time waiting on I/O, and that is true even if just considering a single process.