I have a function that uses the compiler intrinsic __movsq
to copy some data from a global buffer into another global buffer upon every call of the function. I'm trying to nop
out those instructions once a flag has been set globally and the same function is called again. Example code:
// compiler: MSVC++ VS 2022 in C++ mode; x64
void DispatchOptimizationLoop()
{
__movsq(g_table, g_localimport, 23);
// hopefully create a nop after movsq?
static unsigned char* ptr = (unsigned char*)(&__nop);
if (!InterlockedExchange8(g_Reduce, 1))
{
// point to movsq in memory
ptr -= 3;
// nop it out
...
}
// rest of function here
...
}
Basically the function places a nop
after the movsq
, and then tries to get the address of the placed nop
then backtrack by the size of the movsq
so that a pointer is pointing to the start of movsq
, so then I can simply cover it with 3 0x90
s. I am aware that the line (unsigned char*)(&__nop)
is not actaully creating a nop
because I'm not calling the intrinsic, I'm just trying to show what I want to do.
Is this possible, or is there a better way to store the address of the instructions that need to be nop
'ed out in the future?
It's not useful to have the address of a 0x90
NOP somewhere else, all you need is the address of machine code inside your function. Nothing you've written comes remotely close to helping you find that. As you say, &__nop
doesn't lead to there being a NOP in your function's machine code which you could offset relative to.
If you want to hard-code offsets that could break with different optimization settings, you could take the address of the start of the function and offset it.
Or you could write the whole function in asm so you can put a label on the address you want to modify. That would actually let you do this safely.
You might get something that happens to work with GNU C labels as values, where you can take the address of C goto labels like &&label
. Like put a mylabel:
before the intrinsic, and maybe after for good measure so you can check that the difference is the expected 3 bytes. If you're lucky, the compiler won't put any other instructions between your labels.
So you can memset((void*)&&mylabel, 0x90, 3)
(after an assert on &&mylabel_end - &&mylabel == 3
). But I don't think MSVC supports that GNU extension or anything equivalent.
And for efficiency, you want a single 3-byte NOP anyway.
And of course you'd have to VirtualProtect
the page of machine code containing that instruction to make it writeable. (Assuming the function is 16-byte aligned, it's hopefully impossible for that one instruction near the start to be split across two pages.)
And if other threads could be running this function at the same time, you'd better use an atomic RMW (on the containing dword or qword) to replace the 3-byte instruction with a single 3-byte NOP, otherwise you could have another thread fetch and decode the first NOP, but then fetch a byte of of the movsq
machine code not replaced yet.
Actually a plain mov
store would be atomic if it's 4 bytes not crossing an 8-byte boundary. Since there are no other writers of different data, it's fine to load / AND/OR / store to later store the same surrounding bytes you loaded earlier. Normally a non-atomic load+store is not thread-safe, but no other threads could have written a different value in the meantime.
I think cross-modifying code has atomicity rules similar to data. But if the instruction spans a 16-byte boundary, code-fetch in another core might have pulled in the first 1 or 2 bytes of it before you atomically replace all 3. So the 2nd and 3rd byte get treated as either the start of an instruction, or the 2nd + 3rd bytes of a long-NOP. Since long-NOPs generally start with 0F 1F
with an escape byte, if that's not how __movsq
starts then it could desync.
So if cross-modifying code doesn't trigger a pipeline nuke on the other core, it's not safe to do it while another thread might be running the code. Code fetch is usually done in 16-byte chunks but that's not guaranteed. And it's not guaranteed that they're aligned 16-byte chunks.
So you should probably make sure no other threads are running this function while you change the machine code. Unless you're very sure of the safety of what you're doing and check each build to make sure the instruction starts at a safe offset, where safe is defined according to any possibility or anything that could go wrong.