When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these entities is often short-lived and the total amount of memory is often relatively small and should fit into a CPU cache.
At the point the memory used for a temporary string gets released the string content may very well have only lived within the cache. However, the CPU is unaware of the memory being deallocated: the deallocation is just an update in the memory management system. As a result, the CPU may end up writing the unused content unnecessarily to actual memory when the CPU cache is used for other memory - unless the memory release somehow indicates to the CPU that the memory isn't used anymore. Hence, the question becomes:
Do memory management functions releasing memory somehow indicate that the content of the corresponding memory can be discarded? Is there even a way to indicate to the CPU that memory is no longer used? (at least, for some CPUs: there may, obviously, be differences between architectures) Since different implementations will likely differ in quality and may or may not do anything fancy, the question really is if there is any memory management implementation indicating memory as unused?
I do realize that always using the same arena of memory may be a mitigation strategy to avoid unnecessary writes to the actual memory. In that case the same cached memory would be used. Similarly, it may be likely that the memory allocation always yields the same memory also avoiding unnecessary memory transfers. However, possibly I don't need to rely on any of these techniques being applicable.
The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.
On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.
As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.
invd
instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
clflush
instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>
. A victim cacheline may be specified. Writes to this register are privileged.dcbi
, which lets you specify a victim, dci
which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).CACHE
instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.So this excludes the use of cache invalidation without flushing/writing back in usermode outright.
However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:
kmalloc()
, allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192
bytes by steps of 8
; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
glibc
does not invalidate freed memory.jemalloc
does not invalidate freed memory.musl
-libc's malloc()
does not invalidate freed memory.None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.