Search code examples
cperformancememory-managementcpu-usagecpu-speed

CPU efficient algorithms?


I do fair bit of dabbling in performance focused programming. Typically, most of the techniques I have been taught and am aware of pertain to conserving RAM.

That being said, I recently addressing the question here Memory efficient AI objects for physics game

Where I was told:

it is usually the CPU that runs out of speed before memory is exhausted

We did some testing and it seemed packing/unpacking does conserve RAM, but definitely slows down performance.

But as I have said, most of the typical performance 'rules' that I've seen, deal with conserving RAM.

One of the major topics in program speed, for example, is dynamic memory allocation, which is also focused on RAM conservation.

What I want to know is: What makes code CPU efficient? Do lower-level languages like C have more flexibility for CPU efficiency? If so, why/how?

For the sake of simplicity, lets exclude discussion on assembler languages because they are beyond the scope of this question.


Solution

  • Profiler

    First things first, when you go beyond glaring algorithmic inefficiencies, you want to find yourself a nice profiler. A profiler has several benefits:

    1. Shows you precise measurements (where the time is spent, cache misses, branch mispredictions, etc).
    2. Chasing down your top hotspots will tend to rapidly accelerate your learning process and intuition of what causes bottlenecks at a micro level (ex: memory hierarchy and branching).
    3. Will make you a better prioritizer. It'll also teach you what code doesn't need optimization which means you can focus there on other metrics like productivity (maintainability, safety, etc).

    For me #2 was actually quite a big one. I didn't really start learning a lot of this stuff very rapidly until I had a profiler in hand. It's kind of how you can learn programming a lot by working on an actual, sizable project and looking up things that crop up in the middle. In the same way, learning about micro-efficiency and computer architecture tends to be easier with a profiler in hand as you chase one hotspot after another and research why it exists.

    Memory Optimization

    With that aside, probably the number one thing beyond algorithmic complexity (which is about scalability rather than some absolute sense of performance) is memory efficiency.

    Caveat: this is going to be somewhat oversimplified and doesn't go into compiler design topics like register allocation and stack spills or even a very detailed description of the memory hierarchy.

    The way machines and operating systems work is established in a hierarchical form of memory ranging from the absolute fastest but smallest memory (registers) to the absolute slowest and biggest (disk).

    When accessing memory, the system loads it from slower memory into faster memory in largish, aligned chunks. For example, an operating system might page memory from a secondary storage device into physical memory (DRAM) in 4 kilobyte chunks.

    [4 kilobyte chunk][*4 kilobyte chunk][4 kilobyte chunk][...]
    // '*' indicates the chunk that's loaded in.
    

    When you request to access virtual memory anywhere surrounding an aligned, 4 kilobyte chunk, the system will page in that chunk to DRAM. Yet we're not done yet. Typically before we can do anything, we must load from DRAM into the CPU cache, which itself is divided into a hierarchy. In those cases, memory might be loaded into a 64-byte aligned cache line chunk, like so:

    [64-byte chunk][64-byte chunk][*64-byte chunk][...]
    

    ... so memory access ends up loading from DRAM into the CPU cache that way. When you request to access memory in DRAM around one of these 64-byte chunks, the whole 64-byte chunk is loaded into the CPU cache.

    And then the CPU cache itself is divided up into a hierarchy (though typically all using the same cache line size), and memory is moved down to faster but smaller CPU caches (fastest being L1). And last but not least, before doing things like arithmetic, the memory from the L1 cache is loaded into a register, which might be, say, 64-bits in size for a general-purpose CPU register. In that case, we end up having our CPU cache memory laid out like this within a 64-byte cache line:

    [64 bits][64 bits][64 bits][*64 bits][64 bits][...]
    

    So finally after working our way down to the smallest and fastest memory, we do some arithmetical instructions on the registers and then typically move the results back up the hierarchy.

    Now this is somewhat of a gross simplification, and I might end up becoming embarrassed about it later. Yet the thing to keep in mind is that the CPU fetches memory from slower, bigger regions to faster, smaller regions in aligned chunks. It grabs memory by the contiguous handful. The hope for doing this is that you end up accessing that memory chunk multiple times (spatial/temporal locality) before it ends up getting evicted later.

    Memory Optimization

    Keeping this in mind, getting the most performance out of your code typically needs to start off prioritizing memory layout and access (besides algorithms and data structures, of course). Without efficient memory access, the fastest arithmetical instructions are hardly going to help.

    One of the things worth keeping in mind here is contiguous arrays. Data arranged contiguously, and accessed in a sequential pattern, is ideal for this kind of memory hierarchy. It's because the computer might grab a big old chunk of memory (page, cache line), then we sequentially plow through it and access the entire chunk while it's in a faster form of memory prior to eviction.

    Use the Data Before It's Evicted

    The worst-case scenario is when you end up loading a big old chunk of memory only to use a little piece of it and then have the system evict it before we use the rest of it. Such scenarios can show up in linked structures like linked lists and trees (lacking a memory allocator to give them a more contiguous representation) where we might end up loading a chunk of memory for a memory region surrounding a node only to access one node inside of it and then evict it.

    Another case where this shows up is in managed languages where each user-defined type has to be allocated separately (ex: through a garbage collector) but aggregated into an array-based list structure. In that case, even though we're storing an array of these objects, each object is actually represented through a reference (like a pointer) which points somewhere else in memory.

    That might be one of the most compelling reasons to use a language like C or C++. They allow user-defined types to be aggregated contiguously as well as allocated on the stack (which has a great deal of temporal locality).

    TL;DR

    If you want to learn more about these subjects, I'd suggest investigating locality of reference. This article is also a must: http://lwn.net/Articles/250967/

    Last but not least, if I'm allowed a shameless plug for a bounty question that I took lots of time to answer: What is the most efficient way to represent small values in a struct?.

    But anyway, first things first is to grab a profiler and start chasing down hotspots. It's the fastest way to learn, and the most productive way to optimize.

    Update

    The wisdom advice in Jenz' fine answer also gave me an urge to include a disclaimer, in that algorithmic efficiency still tends to be the first and foremost thing to worry about. I've worked with those types that talk about cache efficiency and multithreading all day while dealing with the most sub-optimal algorithms, and that's ineffective prioritization. Micro-optimizing or parallelizing a bubble sort of a million elements is far from effective as a blatant example.

    Where a lot of memory optimization techniques tend to help the most immediately is in those sequential cases where there is no choice but to touch every element (no way to get below linear complexity). An example is, say, a particle simulator which needs to process every particle, an image processing algorithm which must affect every pixel, matrix multiplication involving massive matrices. In those cases, there's no way to algorithmically skip a vast portion of the work and still get the same result, as we must process every element. For those times, memory optimization techniques can become even more effective than parallelization, as well as giving you more out of parallelization.

    Yet there is also a memory efficiency concern underlying the heart of data structures and algorithms. A quicksort of an array still tends to beat merge sort in practical scenarios purely because of memory efficiency. There are even cases where a linearithmic algorithm may beat a linear one provided that the former is vastly more memory efficient.

    Memory Allocators

    I mentioned the cache-unfriendliness of linked structures like trees and linked lists earlier, but that's assuming each node is allocated against a general-purpose allocator (and possibly not all at once). One of the things that can start to actually make even a structure like a singly-linked list a whole lot more applicable is the use of a memory allocator which gives its nodes back the spatial locality they would normally lack otherwise. So there are ways to dig under your data structures and utilize memory allocators and make them more efficient that way without actually using a whole new one.

    There are also data structures like unrolled lists which are often overlooked since they offer no algorithmic benefits over a linked list. Yet they offer substantially greater benefits from a memory efficiency perspective, and in those scenarios where we have two data structures which have similar algorithmic complexity but vastly different memory layouts, the winner tends to be the one with more efficient memory layout and access patterns. The unrolled list links arrays of elements together rather than individual elements, and, again, spatial locality strongly favors contiguous array-based representations.

    Yet just about any micro-optimization will degrade the straightforwardness and maintainability of your code. So the key to optimization in general is prioritization, and that's where the profiler can at least somewhat help you stay in check (from a productivity perspective, a profiler has the enormous benefit of showing you what not to optimize that you might have otherwise been tempted to attempt).