Instrumenting memory IO in C/C++ for hardware emulation

Ok, some background on what and why?

I want to compile and run microcontroller firmware (bare metal, no OS) on desktop linux. I don’t want to write a bytecode interpreter or binary translator; I want to compile the original source. Running a FW as standard GUI application has lot of advantages like quick dev iterations, advanced debugging, automated testing, stress testing, etc.. I've done this before with AVR microcontrollers for a few projects and usually took the following steps:

provide HW related headers not existing on desktop (mostly MMIO register definitions -> global variables)
implement peripheral emulation code (lcd, eeprom)
do some GUI that reflects the original device's user interface (lcd, buttons)
glue everything together

The first 3 steps is easy (and not much code for AVR's), the last one is tricky. Some constructs in FW end up as infinite loops in the desktop version (eg. busy loop waiting on peripheral register change, or memory change by interrupt handlers), others end up as no op (writing into MMIO which on the real system trigger something), and fusing the FW's main loop with the GUI lib's main loop needs some creativity too. If the FW is nicely layered then low level code can be substituted with glue functions without hacking too much.

Though the overall behavior is impacted by these changes, I found the end result very useful in many cases. Unfortunately this method is intrusive (FW modification), and glue logic is highly dependent on the architecture of the FW (needs to be reinvented every time).

Getting closer to the question...

From C/C++ point of view the most important difference between the FW and the code running on a proper OS is MMIO. MMIO access has side effects, different side effects for reading and writing. In a desktop app this concept does not exist (unless you poke HW from userspace). If it would be possible to define a hook when a memory location is read or written, that would enable proper peripheral emulation and FW could be compiled mostly intact. Of course this cannot be done in C++, the whole purpose of a native language is against this. But the same concept (tracking memory access runtime) is used by memory debuggers with the help of instrumentation.

I have a few ideas on implementation, so my question is how feasible do you think they are, or is there any other way the achieve the same result?

No instrumentation at all. x86 can signal if a memory location is accessed and it is used by debuggers to implement watchpoints (break on memory access). As a proof of concept I created this test program:
```
#include <stdio.h>
volatile int UDR;

void read()  { printf("UDR read\n"); }
void write() { printf("UDR write\n"); }

int main()
{
    UDR=1;
    printf("%i\n", UDR);
    return 0;
}
```
UDR is the MMIO register I want to track, and if I run the compiled program under GDB with the following script:
```
watch UDR
commands
call write()
cont
end

rwatch UDR
commands
call read()
cont
end
```
The result is exactly what I want:
```
UDR write
UDR read
1
```
The problem is that I don’t know if this scalable at all. As far as I know watchpoints are a limited HW resource, but couldn’t find out the limit on x86. I would probably need less than 100. GDB also supports software watchpoints, but only for writing, so it’s not really usable for this purpose. Another downside that the code would only run under a GDB session.
Runtime instrumentation. If I’m correct Valgrind/libvex does this: reads the compiled binary and inserts instrumentation code at memory access locations (among many others). I could write new Valgrind tool that is configured with addresses and callbacks as the above GDB script, and execute the application a Valgrind session. Do you think this is feasible? I found some documentation on new tool creation but it doesn’t seem to be an easy ride.
Compile time instrumentation. Memory and Address sanitizers in clang and gcc are working this way. It's a 2 part game, the compiler emits instrumented code and a sanitizer library (implementing the actual checks) is linked to the application. My idea is to replace the sanitizer library with an own implementation that performs the above callbacks, without doing any compiler modification (which is probably beyond my capabilities). Unfortunately I didn't find much documentation on how the instrumented code and the sanitizer library interacts, I only found papers describing the checker algorithms.

So that's all for my problem, any comment on any topic is appreciated. :)

Solution

I haven't got time to reply to ALL of the questions in your question, but this is probably going to be too long to be a comment...

So regarding "watch points" in the debugger, they use debug registers, and whilst you can write code to use those registers yourself (there are API functions to do that - you need to be in kernel mode to write to these registers), as you state yourself, you will run out of registers. The number is MUCH lower than your 100 as well. In x86 processors there are 4 debug location registers, that cover reads and/or writes to a 1-8 byte wide location. So, it would work if you have a total of less than 32 bytes of IO space (that are spread over no more than 4 chunks of no more 8 bytes each).

Option 2 has the problem that you need to guarantee that the region used by your IO registers is not used for something else in your application. This may be "easy", if all the IO registers are, say, in the first 64KB. Otherwise, you have to try to figure out if it's a MMIO access or a regular access. As well as writing your own version of Valgrind is not something you instantly get done... Even if you hire the guy that wrote valgrind in the first place...

Option 3 has the same problem as option 2 with regard to matching addresses. My feeling is that this won't help you that much, and you are better off approaching it in a different way.

The approach that I have seen in various chip simulators that I have used is to modify the access to the real hardware into a function call. You could do that in C++ by something like the method MSalters describes.

Or by modifying your code, such that you do:

MMIO_WRITE(UDR, 1);

and then let the MMIO_WRITE translate to:

 #if REAL_HW
 MMIO_WRITE(x, y)   x = y
 #else
 MMIO_WRITE(x, y)  do_mmio_write(x, y)
 #endif

where do_mmio_write is able to understand addresses and what they do in some way.

This is certainly how the GPU model that I use at work for modelling the latest and greatest GPU we're about to make into silicon, and was the model used by the previous company I worked for that had such a model.

And yes, you will have to rewrite some of your code - ideally your code is written such that you have specific small sections of code that touch the actual hardware [this is certainly good practice if you ever want to move from one type of microcontroller to another, since you would otherwise have to do a lot more rewriting in such a case too].

As Martin James points out, the problem with any such simulation is that if your actual simulator isn't VERY good, you run into "compatibility problems" - in particular things like hardware vs software race conditions, where your software is perfectly synchronous to the simulated hardware model, but the real hardware will do things asynchronously to the software, so your two reads of two registers will now get different values than the software model, because some arbitrary thing changed in the real hardware that your software model didn't take into account - and now you have one of those nasty bugs that only occur once in a blue moon, and only on the "can't debug" hardware variant, never in the software model.