Search code examples
memory-leaksgarbage-collectionv8embedded-v8

V8 Memory leak when using optional chaining in script


I've embedded V8 9.5 into my app (C++ HTTP server). When I started to use optional chaining in my JS scripts I've noticed abnormal rise in memory consumption under heavy load (CPU) leading to OOM. While there's some free CPU, memory usage is normal. I've displayed V8 HeapStats in grafana (this is only for 1 isolate, which I have 8 in my app) heap stats

Under heavy load there's a spike in peak_malloced_memory, while other stats are much less affected and seem normal. I've passed --expose-gc flag to V8 and called gc() at the end of my script. It completely solved the problem and peak_malloced_memory doesn't rise like that. Also, by repeatedly calling gc() I could free all extra memory consumed without it. --gc-global also works. But these approaches seem more like a workaround rather than a production-ready solution. --max-heap-size=64 and --max-old-space-size=64 had no effect - memory consumption still did greatly exceed 8(number of isolates in my app)*64Mb (>2Gb physical RAM).

I don't use any GC-related V8 API in my app.

My app creates v8::Isolate and v8::Context once and uses them to process HTTP requests.

Same behavior at v9.7.

Ubuntu xenial

Built V8 with these args.gn

dcheck_always_on = false
is_debug = false
target_cpu = "x64"
v8_static_library = true
v8_monolithic = true
v8_enable_webassembly = true
v8_enable_pointer_compression = true
v8_enable_i18n_support = false
v8_use_external_startup_data = false
use_thin_lto = true
thin_lto_enable_optimizations = true
x64_arch = "sandybridge"
use_custom_libcxx = false
use_sysroot = false
treat_warnings_as_errors = false # due to use_custom_libcxx = false
use_rtti = true # for sanitizers

And then manually turned static library into dynamic one with this (had some linking issues with static lib due to LTO that I didn't want to deal with in future):

../../../third_party/llvm-build/Release+Asserts/bin/clang++ -shared -o libv8_monolith.so -Wl,--whole-archive libv8_monolith.a -Wl,--no-whole-archive -flto=thin -fuse-ld="lld"

I did some load testing (since problem occurs only under load) with and without manual gc() call and this is the RAM usage graph during load testing with timestamps: RAM usage

  1. Started load testing with gc() call: no "leak"
  2. Removed gc() call and started another load testing session: "leak"
  3. Brought back manual gc() call under low load: memory usage started to gradually decrease.
  4. Started another load testing session (with gc() still in script): memory usage quickly decreased to baseline values.

My questions are:

  1. Is it normal that peak_malloced_memory can exceed total_heap_size?
  2. Why could this occur only when using JS's optional chaining?
  3. Are there any other, more correct solutions to this problem other than forcing full GC all the time?

Solution

  • I think I got to the bottom of this...

    Turns out, this was caused by V8's --concurrent-recompilation feature in conjunction with our jemalloc configuration.

    Looks like when using optional chaining instead of hand-written function, V8 more aggressively tries to optimize code concurrently and allocates far more memory for that (zone-stats showed > 70Mb of memory per isolate). And it does that specifically under high load (maybe only then V8 notices hot functions).

    jemalloc, in turn, by default has 128 arenas and background_thread disabled. Because with concurrent recompilation optimization is done on a separate thread, V8's TurboFan optimizer ended up allocating a lot of memory in the separate jemalloc's arena and even though V8 free'd this memory, because of jemalloc's decay strategy and because this arena wasn't accessed anywhere else, pages weren't purged, thus increasing resident memory.

    Jemalloc stats:
    Before memory runaway:

    Allocated: 370110496, active: 392454144, metadata: 14663632 (n_thp 0), resident: 442957824, mapped: 570470400, retained: 240078848
    

    After memory runaway:

    Allocated: 392623440, active: 419590144, metadata: 22934240 (n_thp 0), resident: 1712504832, mapped: 1840152576, retained: 523337728
    

    As you can see, while allocated memory is less than 400Mb, RSS is at 1.7Gb due to ~300000 of dirty pages (~1.1Gb). And all those dirty pages are spread out on a handful of arenas with 1 thread associated (the one on which V8's TurboFan optimizer did concurrent recompilation).

    --no-concurrent-recompilation solved the issue and I think is optimal in our use case where we allocate an isolate for each CPU core and distribute the load evenly, so there's little point in performing recompilation concurrently from a bandwidth standpoint.

    This can also be solved on jemalloc's side with MALLOC_CONF="background_thread:true" (which, allegedly, can crash) or by reducing the number of arenas MALLOC_CONF="percpu_arena:percpu" (which may increase contention). MALLOC_CONF="dirty_decay_ms:0" also fixed the issue, but it is a suboptimal solution.

    Not sure how forcing GC helped to regain memory, maybe it somehow triggered access to those jemalloc arenas without allocating much memory in them.