I'm learning more about the theoretical side of CPUs, and I read about how cache can be used to fetch a line/block of memory from RAM into an area closer to the CPU that can be accessed more quickly (I think it takes less clock cycles because the CPU doesn't need to move the entire address of the next word into a register, also it's closer to the CPU physically).
But now I'm not clear on the implementation exactly. The CPU is connected to RAM through a data bus that could be 32 or 64 bits wide in modern machines. But L3 cache can in some cases be as large as 32MB in size, and I am pretty convinced there aren't millions of data lines going from RAM to the CPU's cache. Even the tiny-in-comparison L1 cache of only a few KB will take hundreds or even thousands of clock cycles to fetch from RAM only through that tiny data bus.
So what I'm trying to understand is, how exactly is CPU cache implemented to transfer so much infortmation while still being efficient? Are there any examples of simple (relatively) CPUs from the last decades at which I can look to see and learn how they implemented that part of the architecture?
As it turns out, there actually is a very wide bus to move info between levels of cache. Thanks to Peter for pointing it out to me in the comments and providing useful links for further reading.