Why aren't there any software that force the use of multiple cores?

So this is a purely hypothetical question. I have to first put up a disclaimer: I have literally no clue how processors work on a low level, or even on a high level, however low level and high level explanations are appreciated, as I can still wrap my head around the answers (maybe taking me a few hours).

So the question is: how come there are software that just cannot take advantage of multiple cores or threads? Or a better wording, how come multithread support has to be coded in the software, and isn't something the processor would automatically assign to all it's cores, regardless of the code?

My very naive way of looking at it is that the software will request some calculation from the CPU, so why can't the CPU have a "master thread", which does nothing but assign the calculations to each of the other threads, and then forward the result back to the software as they come?

Because I know that a lot of software can only use one core at a time, and from my naive understanding of how a CPU works, there shouldn't be a reason stopping it from just sending the computations to all available cores.

On that note, the main question: Is it possible to create a software (or driver) which enables ANY software to use all available cores, regardless of how it has been coded?

Solution

On that note, the main question: Is it possible to create a software (or driver) which enables ANY software to use all available cores, regardless of how it has been coded?

No, for the same reason two women cannot deliver a baby in four and a half month.

Computation is transformation of the data, from the input to the output, each step reading the data it needs and producing its result.
It's clear that this means there are dependencies between the steps: (x + 1)^2 for x = 3 is 16 but to get this result we first perform the step y = x + 1 and than the step y^2.
We cannot compute (y)^2 before, or even concurrently, with x + 1 to get the correct result.

In short, not everything is parallelizable.

The CPU, as Harold pointed out, can exploit the intrinsic parallelism of some computation: (x + 1) + (x + 2) can be split into computing y = ( x + 1) and z = (x + 2) in parallel and then doing y + z.
It's all about the dependency chains of the computations.

The hard thing about this optimisation is that, contrary to these examples, the instructions often have side effects and one must be very careful to take them into account.
Most effort nowadays goes into doing a fast prediction of when a normally forbidden optimisation is allowed, prediction that is accurate most but not all of the times, and doing a fast recovery from a misprediction. Furthermore, the is a limit on the resources that are available when looking for or tracking these optimisations.

All this logic is packed into a core, it fetches, decodes, issues, dispatches, executes, and retires the instructions in a way that exploits the intrinsic parallelism.

Even with this help the core usually have more functional units than those usable by the program, this may be due to the use of integers only, for example. Also, since modern CPUs are very complex, exploiting them fully is also complex.
That's why SMT (i.e. the two threads in each core) was introduced: each thread has its own program (context) but share every other resource in the core and while a program is using the integers another using the floating points can make the CPU fully used.

However each thread has its context, it's like each thread has its own value for x, y, z.
If we compute y = (x + 1) in Core0 we cannot send y^2 to Core1 because the y used will be the one in Core1 and thus the wrong one.
Thereby to parallelise a program it is necessary the human intervention to split a single program into two ore more. Sending y^2 to Core1 would also require sending y and that would be too slow, below's why.

When the cost of adding another core became lower than the cost of further optimising the core microarchitecture, the manufacturers started including multiple cores.

Why can't the mechanism used to exploit the intrinsic parallelism be extend to dispatch instructions to multiple cores/threads?
Because, it's impossible electronically.
In order for it to work, there must be a shared context (set of variables x, y, ...) and having a single context being accessed by a lot of cores would make it slow.
It may not be intuitive to understand but choosing between 16 destinations is faster than choosing between 32. The same is true when managing 4 readers instead of 16.
Furthermore at the speeds of the modern CPUs, the geometry of the traces matters a lot.

So the core are designed to be fast, have fast internal busses and fast tightly coupled components working more or less at the same frequency.
The CPU uncore is designed to be as fast as possible with fast decoupling between the cores and other components working at different frequency.

In short, dispatching instructions to other cores would be slow, communication between cores is order of magnitude slower than communication intra-core.
For general purpose CPUs it is not convenient to send the data with the program.
It is more performant to have the programmer program each core/thread individually and exchange the data needed when it is needed.

For specific purpose ASIC may take a different approach, for example GPUs have a different parallelism than CPUs.