Why Golang MADV_FREE leads to OOM sometimes?

We use the go1.12 and k8s deployment services. In the actual production environment, we have a project that has been OOM until container is killed. Through online survey, it is because Golang MADV_FREE , later we set to MADV_DONTNEED, the problem is sloved.

On the Internet, it said it was MADV_Free means that the system releases memory only when it feels pressure. But memory Alloc happens all the time, Our other services are in the same environment. Why is there no OOM happen?

Solution

Well, I doubt such question is fit for SO as it's unlikely to have short on-point answers, still, let me get a go at it.

The first thing to consider is that the in-kernel OOM killer, when engaged when the kernel finds there's a memory shortage, merely locates a process with the highest memory consumption¹ and brings it down. (I hope you're talking about the in-kernel OOM killer and not some k8s-specific service or something you've developed in-house.)

Then let's consider the Go 1.12 release notes which has switched to using MADV_FREE:

On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.

(Emphasis mine.)

What this means is that if, say, a program is compiled using Go 1.12 runs during some measure of time under some standard load and then the same program runs during the same measure of time under the same load but with GODEBUG=madvdontneed=1 setting, the apparent consumption of RSS — as seen from the outside — will be higher in the first case than in the second.
To reiterate, the number of memory pages actually marked with madvise(2) by the Go memory manager will be roughly the same during the both runs, but due to the different semantics of handling the pages freed using both of these ways, the readings of the RSS in use will be different. Not the actual memory usage but only readings of the RSS.

This obviously makes a process which returns memory to the OS using MADV_FREE be more likely picked by the OOM killer.

Having said that, I'd advise you to actually look at your problem from a different angle. Measuring memory consumption of a program written in Go and build by a "stock" Go implementation is not exactly useless but is only useful to catch sort of obvious stuff like steady memory growth over multiple GC scan cycles which possibly indicates a memory leak. To actually assess the real memory usage pattern, you have to use the mertics provided by the Go runtime of the running program. I have tried to detail the reasons for this here.

So I would say, you'd better concentrate on fixing the OOM killer settings or something like this (note that it's possible to make a particular process be immune to the OOM killer).

Also note that you're using a dirt-old unsupported version of Go, and in Go 1.16 the behavior of madvise has been reverted once again, so it uses MADV_DONTNEED again. Maybe it's a good time to upgrade.

¹ It's actually more complex than that as the OOM killer has a set of heuristics which it uses to find "resource hogs", and the memory consumption is just one of the metrics it considers.