Suppose I have an int
array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0]
to arr[15]
.
I would like to know what happens when you fetch, for example, arr[5]
from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n
bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.