x86_64 - Self-modifying code performance

I am reading the Intel architecture documentation, vol3, section 8.1.3;

Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.

So, if I respect the rules:

(* OPTION 1 *) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;

(* OPTION 2 ) Store modified code (as data) into code segment; Execute a serializing instruction; ( For example, CPUID instruction *) Execute new code;

AND modify the code once a week, I should only pay the penalty the next time this code is modified and about to be executed. But after that, the performance should be the same as non modified code (+ the cost of a jump to that code).

Is my understanding correct?

Solution

There's a difference between code that's simply not yet cached, vs. code that modifies instructions that are already speculatively in-flight (fetched, maybe decoded, maybe even sitting in the scheduler and re-order buffer in the out-of-order core). Writes to memory that's already being looked at as instructions by the CPU cause it to fall back to very slow operation. This is what's usually meant by self-modifying code. Avoiding this slowdown even when JIT-compiling is not too hard. Just don't jump to your buffer until after it's all written.

Modified once a week means you might have a one microsecond penalty once a week, if you do it wrong. It's true that frequently-used data is less likely to be evicted from the cache (that's why reading something multiple times is more likely to make it "stick"), but the self-modifying-code pipeline-flush should only apply the very first time, if you encounter it at all. After that, the cache lines being executed are in prob. still hot in L1 I-cache (and uop cache), if the 2nd run happens without much intervening code. It's not still in a modified state in L1 D-cache.

I forget if http://agner.org/optimize/ talks about self-modifying code and JIT. Even if not, you should read Agner's guides if you're writing anything in ASM. Some of the stuff in the main "optimizing asm" is getting out of date and not really relevant for Sandybridge and later Intel CPUs, though. Alignment / decode issues are less of an issue thanks to the uop cache, and alignment issues can be different for SnB-family microarches.