Search code examples
performancememorymemory-managementdynamic-memory-allocationcpu-cache

Do memory allocation functions indicate that the memory content is no longer used?


When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these entities is often short-lived and the total amount of memory is often relatively small and should fit into a CPU cache.

At the point the memory used for a temporary string gets released the string content may very well have only lived within the cache. However, the CPU is unaware of the memory being deallocated: the deallocation is just an update in the memory management system. As a result, the CPU may end up writing the unused content unnecessarily to actual memory when the CPU cache is used for other memory - unless the memory release somehow indicates to the CPU that the memory isn't used anymore. Hence, the question becomes:

Do memory management functions releasing memory somehow indicate that the content of the corresponding memory can be discarded? Is there even a way to indicate to the CPU that memory is no longer used? (at least, for some CPUs: there may, obviously, be differences between architectures) Since different implementations will likely differ in quality and may or may not do anything fancy, the question really is if there is any memory management implementation indicating memory as unused?

I do realize that always using the same arena of memory may be a mitigation strategy to avoid unnecessary writes to the actual memory. In that case the same cached memory would be used. Similarly, it may be likely that the memory allocation always yields the same memory also avoiding unnecessary memory transfers. However, possibly I don't need to rely on any of these techniques being applicable.


Solution

  • No.

    The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.

    On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.

    As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.

    Architectural Support

    • The x86 and x86-64 architecture has the privileged invd instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
      • The non-privileged clflush instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.
      • Documentation for all these instructions is in Intel's SDMs, Volume 2.
    • The ARM architecture performs cache invalidation without writeback with a write to coprocessor 15, register 7: MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>. A victim cacheline may be specified. Writes to this register are privileged.
    • PowerPC has dcbi, which lets you specify a victim, dci which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).
    • MIPS has the CACHE instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.

    So this excludes the use of cache invalidation without flushing/writing back in usermode outright.

    Kernel mode?

    However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:

    • Linux's allocator, kmalloc(), allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192 bytes by steps of 8; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
      • This problem is compounded by the fact that cache lines may be quite large (on x86-64, 64 bytes), and moreover are not necessarily uniform in size throughout the cache hierarchy. For instance, Pentium 4 had 64B L1 cachelines but 128B L2 cachelines.
    • It makes the deallocation time linear in the number of cachelines of the object to deallocate.
    • It has a very limited benefit; The size of L1 cache is usually in the KB's, so a few thousand flushes will fully empty it. Moreover the cache may already have flushed the data without your prompting, so your invalidation is worse than useless: The memory bandwidth was used, but you no longer have the line in cache, so when it will next be partially written to it will need to be refetched.
    • The next time the memory allocator returns that block, which might be soon, the user thereof will suffer a guaranteed cache miss and fetch from main RAM, while he could have had a dirty unflushed line or a clean flushed line instead. The cost of a guaranteed cache miss and fetch from main RAM is much greater than a cache line flush without invalidation tucked in somewhere automatically and intelligently by the caching hardware.
    • The additional code required to loop and flush these lines wastes instruction cache space.
    • A better use for the dozens of cycles taken by aforementioned loop to invalidate cachelines would be to keep doing useful work, while letting the cache and memory subsystem's considerable bandwidth write back your dirty cachelines.
      • My modern Haswell processor has 32 bytes / clock cycle write L1 bandwidth and 25GB/s main RAM bandwidth. I'm sure a couple extra flushable 32-byte cachelines can be squeezed in somewhere in there.
    • Lastly, for short-lived, small allocations like that, there's the option of allocating it on the stack.

    Actual memory allocator practice

    • The famed dlmalloc does not invalidate freed memory.
    • glibc does not invalidate freed memory.
    • jemalloc does not invalidate freed memory.
    • musl-libc's malloc() does not invalidate freed memory.

    None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.