caching assembly mips cpu-architecture memory-address

How does the direct mapped cache return data?

I am taking the first class in computer architecture and assembly with the Computer Organization & Design by Patterson & Hennessey textbook. I am currently learning about caches. I understand that the cpu will check the cache for 1. index of cache block, 2. valid bit in that block, 3. valid tag. From there if it is valid the data is sent to the processor.

I am having trouble picturing this as a real thing though. For example maybe we want to load word from $x to $y. The processor gets an address say 0x12345670 to represent $x. So then '0' would be the offset, '67' would be the index and the rest would be the tag. We got a hit from the cache, but where is the data? Is the address in question the data? Is there more space in the cache that holds the data we need? Does it send you directly to that location in memory? How do we get the data from $x through the cache assuming it is there.

Also assuming 32bit word length, you send the address to the cache and get a full word back or just the bits that are enough?

PS. it is most helpful for me to learn from real examples, so if You have resources on practice for this (and mips programming particularly non leaf functions) I would really appreciate that as well.

Solution

A cache is not a programming thing, if you want some real examples look at openocores.org; there are at least a few cores there that have caches in them (you have to read the vhdl/verilog). It is not something you see in a mips program necessarily.

On that note, understand this textbook is a textbook mips nor others have made a processor that follows that exact model; designs exactly like that either never happened or stopped happening a while ago. Many folks learned from that book so we often speak using those terms - but pipelines are deeper and different, caches are mostly the same but can vary in size and width and definitely on how to determine who loses and gets evicted.

A cache is often described as having some number of bytes, such as a 64Kbyte cache. Those are the bytes that hold the data. The data read from the slow side, the dram/main memory is cached or stored in the cache. That is where it lives.

When you do a read of a byte at 0x12345670, and lets say the cache line is 256 bytes, then the read from the slow/dram side is at the equivalent of 0x12345600 if you get a miss. If you get a miss then the cache is designed to determine where in that cache will store the cache line with your byte in it. If there is someone else squatting there then their data needs to be evicted. If that data is newer than what is in dram (some write happened to it) then that data gets written to dram before it reads your cache line to the cache, then ultimately your byte is sent to you (more likely the bus width, 32 or 64 bits are sent to you and the processor isolates the byte lane from that). If the cache line was empty or if the data there did not need to be written back to dram then the cache simply reads the line with your byte in it from the dram/slow memory side and delivers your byte.

A write is very similar but as implied above if there is a hit then the read-modify-write happens and your done. If there is a miss then does something have to be evicted, then a read from slow memory, then the read-modify-write happens. Ideally a bit somewhere is set by the cache so that it knows that this cache line is newer than the copy in dram and when evicted it needs to be written to dram it cant be discarded.

The cache also acts as a convenient store and forward to the slower dram. You want to limit unnecessary dram accesses, if you wanted to write a byte to dram you don't want to read a dram memory/bus width modify the byte and write it back at those speeds, you want to do that in sram in the cache. The cache allows for the slower side to have all of its transactions one optimal size. There can be multiple caches between the processor core and the system memory, so not all of the caches will do this or have to, but ideally the last one does.

If I had a cache with 256 byte cache lines then the lower 8 bits are the offset into the cache line and you wouldn't use those for the remaining addressing. if I had a 128Kbyte cache with 256byte cache lines that means I can hold 512 thing. I need 9 bits to pick from 512 thing. Some of those bits come from the remaining address bits (0x123456). So that I know exactly which 256 bytes this is in the memory space whatever bits of the address are left over have to be stored along with my cache line. So a super simple one would be the lower bit of 4 and the 0x56 are used to find which of the 512 lines and the remaining 0x1234 minus that lsbit 0b000100100011010 have to be stored in the cache as part of the look up (the actual tag). So I need 512 * 15 bits of ram for the tags themselves plus some more bits to mark valid/invalid and dirty/clean.

There is a delicate balance between the cache line size and overhead. Caches are a gamble as far as performance goes, you can always defeat them and find a benchmark that makes the cache worse than if you didn't have one. The smaller the cache line, the less waste you have per transaction, if you read single bytes at random locations in memory and that mean the cache is reading 256 bytes for every one of yours, not very efficient, so 64 bits per line to your 8 is much less painful. But if your programs are more linear, string copies, programs that don't branch much, then a larger cache line might be useful. The larger the cache line the smaller the tag you have to store and the less memory as well as less overhead in general.

You can have multiple ways as well instead of the above where there are 512 256 byte cache lines in my cache. I might choose to take 7 of the address bits rather than 8 and have 4 ways. If I had a program that was jumping 0x100 bytes per read, then one cache line is going to get pounded on and the other 511 aren't getting used much. But instead I take fewer bits (7) from the address, and have some number of ways (associativity?). That one address has one of four possible places to land, and the logic looks for hits in those four locations and if they all miss then there are any of them unused if not then there is an algorithm to determine who gets evicted (last one in or a randomizer, or round robin, etc). If I jump by 0x100 for a bit then at least 4 lines are used not 1. Hopefully the gamble pays off and the program comes back to use some of those lines.

This should all be in your textbook in some form using some language.

A simplified example that doesn't cover the whole reality of the situation. You are answering phones for some company, the boss is crazy and dictates that you cannot have more than 16 messages/notes at any one time before you have to carry those to the employee. You take a call, write down the employee name and message. repeat. At or before you hit 16 you have to sneaker net (walk) that message to the employee. If you get to 16 you cannot answer any more calls until you deliver at least one and free up a slot. Assume it is like the old days and the phone keeps ringing until you pick it up (you are the answering machine). If you get a second call for that employee before you walk the note to them, you CAN write that second message down on the same note, it does not count as two. The note is a cache line. The employee name is the tag and the messages are the bytes in the cache line. And 16 is the total number of cache lines I can store in this cache. Your eyes are the logic that looks at the notes to see if there is a hit or miss likewise you determine evictions when the time comes. And you can deliver more than one per trip from your desk. So you get to design your phone cache to determine if you want to always wait until you have 16 slips before you deliver one/some hoping to make fewer trips if someone gets more than one message, or if you deliver them as soon as you get them. Or somewhere in between. You cannot predict when calls will come so you cannot perfectly design the solution, but the goal is to improve the number of rings before someone (you) answers and also improve the delivery to the employees of the messages. Note in this fantasy the employees don't care what the latency is from the time the call happened to when they get the message, so long as all messages are eventually delivered. Doesn't cover the real memory cache problem and solution, but maybe helps you think about it, maybe makes your understanding worse, if so sorry.

In general as a programmer the cache is just part of the memory system that makes your computer work you are not actually talking to it you are generically using the memory space of the processor allocated to your program to do stuff. Want to have a string and manipulate it you do that. Your program itself is acting on a cache most likely, normally you don't need to obsess about where the loops are or how they are aligned. And you probably are not thinking about the MMU either which can/does make what you think is a linear address space given to you by the operating system actually be fragmented across physical memory and then how those fragments hit the cache is not under your direct control. If you generically write a program to run on Windows or Linux or a Mac or iOS or Android, you have no idea the MMU/cache configuration that sits behind your high level program. bare metal sure you are in control and you can turn the knobs you can turn (mmu, code and data alignment and striping). Some systems/chips you can read the dimensions of the cache, some you can mess with the parameters. And in that/those situations you can have fun either showing the cache is greatly helping and/or show the cache is making the code run slower.

Edit

'67' would be the index

I would say no to that but if you think that is right then fine.

but where is the data?

It is in the cache, search the web for the word "cache" or look it up in a dictionary. The cache holds the data.

Is the address in question the data?

The address in question becomes in whole or in part the tag.

Is there more space in the cache that holds the data we need?

The data is in the cache.

Does it send you directly to that location in memory?

If there is a miss then the data is read from main memory and stored in the cache, then at least the portion you asked for is sent to you.

Also assuming 32bit word length, you send the address to the cache and get a full word back or just the bits that are enough?

Determined by the bus design and implementation. Not uncommon for reads to be a unit like that 32 or 64 bits and the processor extracts from those the byte or whatever it was looking for. This is NOT inefficient, it is often more efficient on modern processors do do as much as you can in these kinds of units, using byte sized variables can cost you performance, it technically might save you memory but the 1970s ended a long time ago you have plenty of memory. Normally writes are the ones where size matters there is often either a byte mask indicating which bytes of the bus have real data that has to be stored or a size and the bus uses data bus bits 0 to N-1.

PS. it is most helpful for me to learn from real examples, so if You have resources on practice for this (and mips programming particularly non leaf functions) I would really appreciate that as well.

Opencores.org has a few processors with caches in them.