caching multiprocessing processor numa mesi

Why can't be provided a direct access from one processor to the cache of another processor?

In NUMA architecture (Non-uniform memory access) each processor has it's own first level cache, so there's a protocol (MESI) for processor communication. But why can't each processor be connected to other's caches directly? I read that "The connection simply isn't fast enough", but that didn't explain too much.

Thanks.

Solution

First, having a L1 cache doesn't imply a NUMA architecture, the motherboard topology is still the primary element that make a machine UMA or NUMA.

Second, the Cache Coherence protocol in use is architecture dependent and may differ from MESI (Actually MESIF is a better fit for NUMA machines).

Turning to your question

Each processor can be connected to each other processor cache. Indeed every cache coherence protocol do this, just not by allowing direct read/write as it would take a lot of efforts with poor reusability.

However it is possible to connect directly a CPU to another CPU cache and actually it is implemented in a way on the Intel CPUs.
Logical cores (i.e. HyperThreading cores) may share L2 cache and some physical core in the same package may share L3 cache.
However there two important aspect here: first, the number of CPUs sharing a cache is low and second they are in the same core/package.

Connecting all the caches directly would lose the boundary between what is inside the CPU (as a whole) and what is outside of the CPU.
Isolating the CPU let us create very customizable and modular systems, an external protocol is an interface that let us hide the implementation details, this worth more than the gain in speed given by closely connected caches.
When we need such a speed, we build dedicated integrated system components, like a coprocessor.

There are various reasons why caches are not directly connected, I cannot speak for industry leaders but here some generic thoughts.

It doesn't scale.
2 processors means 1 link, 3 processors means 3 links, 4 processors means 6 links and so on.
n processors need C(n, 2) links that is n * (n-1) / 2 links.
Also you could connect only CPUs with compatible cache interfaces, and this may imply that you could connect only identical CPUs. Cache architecture is something that change frequently, lines may be made bigger, associativity may change, timings of the signals can be faster.
Lastly, if a CPU has enough pins to connect to only four more CPUs, you can create only quad-cpu systems.
It requires a lot of pins.
Givin access to the caches require a lot of pins, there are two or three caches per core and each one need to be addressed and controlled, this requires to expose a lot of pins, serial interface is not an option as it would be too slow.
If you add that each processor must be connected to each other than the number of pins explodes quadratically.
If you use a shared bus between caches, you are actually using a protocol like MESI, a protocol that try to avoid congestionating the bus, because if you have even few CPUs the traffic on the shared bus is quite intense and the time spent waiting for its turn to use it will slow down the CPU (even with store buffers and invalidation queues).
It is slow.
The cache is highly integrated with the core, it may support multiple read/write ports and other interfaces that increase parallelization. All this cannot be exposed out of the package/core without a large numbers of pins (and a huge increase in size and cost).
The cache is physically close to the core, this minimize the propagation delay, consider that the period of a 3GHz CPU is 1/3 * 10^-9, in that time the light can travel at most 10 cm, or 5 cm for a round-trip, and the signal doesn't propagate at the speed of light.
Furthermore when a cache is accessed only by a core, the designer can make some optimizations based on the internal architecture of the core. This is not possible if the core belongs to another, possibly different, CPU.
It is complex.
Letting a cache being accessed by multiple CPU require replicating a lot of circuitry, for example being the caches associative, it means that when an address is requested, a tag must be verified between a set of possible candidates and this circuit must be replicated to allow others CPUs to read/write the cache asynchronously.

So briefly: It could be possible to connect caches directly, it is just not worth for discrete components. It is done for integrated components.