Search code examples
optimizationmemorydata-modelinglow-level

Optimizing low-level structure for CPU memory performance in my game simulator


I am writing a world simulator for a turn-based game. The key feature is that there will be many NPCs with detailed psychology modelling, perhaps 20,000 - 30,000 of them. I want all NPCs to make their plays in a turn in <1 sec total, and the game should fit in 1-2 GB of memory in total. To this end, I am writing it in Rust.

I have a key struct that is called "Character" and represents an NPC character in the game. As part of the turn, a given CPU core is going to do extensive calculations on each Character in a batch mode before moving to the next Character. During this process, the CPU core will do many random accesses to many different segments of the Character data, which can't be optimized in order due to the complexity of the game simulation logic. Since the game will go through all characters each turn, miliseconds count in terms of execution.

I am primarily wondering if there is a memory size breakpoint after which memory access to each Character will become inefficient in terms of cache sizes, OS memory pages, etc. Right now each Character record is just under 32kb in size. So is there going to be any extra inefficiency in memory access, if it goes to, say, 48kb?

I am currently implementing quite a few measures to keep this Character structure small and tight in memory.

  • Implementing parts of the data as inline arrays so that everything fits in memory.
  • Implementing tight data structures such as bit fields, etc.
  • Implementing 2 different levels of detail ("zoomed in" characters and rest), and moving extra data for "zoomed in" characters to a Box

The main question I am asking myself is whether this effort could have been better spent in developing game logic instead.


Solution

  • the CPU core will do many random accesses to many different segments of the Character data

    Be aware that random access are generally inefficient, especially because they cannot be predicted by the CPU. They are not so bad when data in the same cache line are also read/written (spatial locality) or when the same data are read/written later (temporal locality).

    To understand how bad random accesses are, you should keep in mind that the latency of DRAM accesses is generally >50 ns (>200 cycles at 4 GHz) and a cache line is generally 64 bytes. Thus, requesting an integer of 4 bytes can result in huge CPU stalls and the major part of the (precious) DRAM bandwidth being wasted (>93% here).

    In practice, modern CPUs are designed to avoid stalls like this thanks to out-of-order execution (and instruction-level parallelism): the CPU just executes other independent instructions so to mitigate the latency. This assume the code is written in a way that the CPU can do that. OOP code with virtual calls are really bad for that, and the same is true for one with a lot of conditions or chain of memory accesses spread everywhere.


    I am primarily wondering if there is a memory size breakpoint after which memory access to each Character will become inefficient in terms of cache sizes, OS memory pages, etc.

    Well, mostly yes. But there is no one standard limit. This is highly dependent of the target platform and more specifically the exact CPU architecture (and even the microcode version), the exact DRAM specification, the exact OS (including its exact version). This is a very complex topic. Generally, the important point to keep in mind is that the bigger the stride the higher the latency and so the slower the code (doing random accesses).

    For example, if you read a value in memory and then read another value in the same cache line, the second fetch is often considered as a free one. But the truth is that it depends of the CPU architecture. Based on experimental tests, this is rather true on Intel Skylake-like CPUs (and probably most recent Intel CPUs), but it is not so cheap on AMD Zen-like CPUs when multiple memory accesses are done concurrently (the second access can prevent fetching the subsequent expensive random accesses decreasing performance). AFAIK, this is also true on the Apple M-like CPUs. This is because of a limited concurrency in each core (for example, on Intel Skylake, the LFB units is limited to 12 entries which is very small for accesses lasting >50 ns).

    Note some CPU can fetch 2 cache lines at once so to mitigate the huge latency (AFAIK some Intel CPUs do that) and hide it with memory transfers.

    Then comes the cache structure. For example, accesses with a stride being a power of two are generally a bad idea because they tends to results in cache trashing and more specifically conflicting accesses to the same cache line while having different addresses. This is especially true with power-of-two cache size. Some relatively-recent CPU have non-power-of-two size (eg 48 KiB L1) which help a bit but this is still a bad idea overall (because 48 KiB = 16 KiB * 3). See this post for more information about this point and more generally the performance of stridded accesses.

    Then comes paging and virtual memory. For example, if you do random access to too many pages, then it will results in TLB misses. Intel CPU have 2-level LTB (DTLB and STLB). Like for basic caches, DTLB misses are less expensive than STLB ones. Page walks are particularly expensive. Huge pages can help to reduce this overhead but be aware that the number of entry in the TLB is dependent of the CPU architecture and the size of the pages fetched (there are generally less entries for very huge pages).

    Then comes the DRAM. This is clearly the most complicated part. The thing to keep in mind is that the speed of DRAM accesses are dependent of the sequence of the actual addresses fetched (so the order matter), the exact timings of the DRAMs as well as its exact software configuration (see in the BIOS), etc. despite the "random-access" in the name. In practice, mainstream DRAMs are divided in banks and structured as arrays with many lines and columns. If you accesses data to the same banks, then the DRAM cannot really hide this latency (this is the main purpose of banks). Some CPU architecture (AFAIK most mainstream Intel CPUs) can shuffle the addresses so to mitigate this issue. As for the arrays: accessing to different lines introduce some additional delays resulting in a higher latency. The timings can be very different from one DRAM to another, not to mention they can be multiple channel and interleaving enabled impacting the thresholds. The generation of DDR-SDDRAM can typically impact the number of banks, the DRAM frequency and the delays (though the CAS latency tends to be limited nowadays). I strongly advise you to read What Every Programmer Should Know About Memory about this topic. You can then read How much of ‘What Every Programmer Should Know About Memory’ is still valid? so to be up to date ;) .

    Note there are also few other points to consider like NUMA effects for example, especially on server-side CPUs or AMD CPU with multiple CCD.


    I am currently implementing quite a few measures to keep this Character structure small and tight in memory.

    Reducing the size is a good idea but you can also reorder the data structure so to split the object in different memory area. You can for example store critical common information (headers) in an array and large blobs in a dedicated space. The idea is to minimize the number of cache misses. You can also put the fields you often access close in memory (dependent of the algorithm so it can quickly becomes problematic in a large game with many different computations).

    I also strongly advise you to read more about Data-Oriented Design (which can strongly help to improve performance in such case):

    Besides, note the AAA game industry moved away from object-oriented architectures (which are inherently inefficient for many reasons including cache misses due to random accesses, bad memory-layout, virtual calls, sometimes encapsulation, etc.) to data-oriented ones combined with Entity Component System (ECS). Your game might benefit from it.

    Last but not least, you can prefetch data when you applied all the above things and there is nothing else to do. However, keep in mind that prefetching is brittle, dependent of the target platform and mainly useful when the hardware prefetching units cannot help much (i.e. random accesses).