Search code examples
goebpfcilium

Elements incorrectly evicted from eBPF LRU hash map


I observe that elements are incorrectly evicted in an eBPF LRU hash map (BPF_MAP_TYPE_LRU_HASH). In the following code I insert into an LRU hash map of size 8 and print its contents every second:

package main

import (
    "fmt"
    "github.com/cilium/ebpf"
    "log"
    "time"
)

func main() {
    spec := ebpf.MapSpec{
        Name:       "test_map",
        Type:       ebpf.LRUHash,
        KeySize:    4,
        ValueSize:  8,
        MaxEntries: 8,
    }

    hashMap, err := ebpf.NewMap(&spec)
    if err != nil {
        log.Fatalln("Could not create map:", err)
    }

    var insertKey uint32

    for range time.Tick(time.Second) {
        err = hashMap.Update(insertKey, uint64(insertKey), ebpf.UpdateAny)
        if err != nil {
            log.Printf("Update failed. insertKey=%d|value=%d|err=%s", insertKey, insertKey, err)
        }

        var key uint32
        var value uint64
        count := 0
        elementsStr := ""

        iter := hashMap.Iterate()

        for iter.Next(&key, &value) {
            elementsStr += fmt.Sprintf("(%d, %d) ", key, value)
            count++
        }

        log.Printf("Total elements: %d, elements: %s", count, elementsStr)

        insertKey++
    }
}

When I run the above program, I see this:

2023/03/29 17:32:29 Total elements: 1, elements: (0, 0) 
2023/03/29 17:32:30 Total elements: 2, elements: (1, 1) (0, 0) 
2023/03/29 17:32:31 Total elements: 3, elements: (1, 1) (0, 0) (2, 2) 
2023/03/29 17:32:32 Total elements: 3, elements: (3, 3) (0, 0) (2, 2) 
...

Since the map has eight entries, I would expect the fourth line to show four values, but it shows only three because entry (1, 1) was evicted.

If I change max_entries to 1024, I notice this problem happens after inserting the 200th element, but sometimes it happens after that. It's not consistent.

This issue is not limited to creating/inserting the map from user space because I observe this problem in my XDP program that creates the map and inserts into it; the above reproduces the issue I observe in my real program. In my real program that also had 1024 entries, I noticed this problem happened after inserting the 16 element.

I tested this on our production servers that run Linux kernel 5.16.7.

I do my testing on a Linux VM, and I upgraded my kernel to 6.2.8, and I observe that the eviction policy is different. For example, when max_entries is 8, I observe this:

2023/03/29 20:38:02 Total elements: 1, elements: (0, 0)
2023/03/29 20:38:03 Total elements: 2, elements: (0, 0) (1, 1)
2023/03/29 20:38:04 Total elements: 3, elements: (0, 0) (2, 2) (1, 1)
2023/03/29 20:38:05 Total elements: 4, elements: (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:06 Total elements: 5, elements: (4, 4) (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:07 Total elements: 6, elements: (4, 4) (0, 0) (2, 2) (1, 1) (5, 5) (3, 3)
2023/03/29 20:38:08 Total elements: 7, elements: (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:09 Total elements: 8, elements: (7, 7) (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:10 Total elements: 1, elements: (8, 8)
...

When max_entries is 1024, I notice after the 1025'th element is added, the total elements are 897. I can't test with kernel 6.2.8 on our production servers.


Solution

  • The LRU hashmap doesn't guarantee that there are exactly the maximum number of items, and the implementation is clearly geared towards providing good performance with far more than 8 items. What I see from a fairly quick glance at the code:

    1. The LRU is separated into two parts, an "active list", and an "inactive list", with a task that moves elements from one to the other periodically depending on whether or not they've been accessed recently. It's not a true LRU (items don't get moved to the head every single time they're accessed).

    2. When the map is full, and something needs to be evicted in order to insert a new item, the code will evict up to 128 items from the inactive list in a single pass; only if the inactive list is empty does it evict a single item from the active list.

    3. There is also a per-CPU "local free list" of allocated items waiting to be filled with data; when that runs empty, it attempts to pull from the global free list, and if that is empty it goes to the eviction path. The target size of the local free list is 4 items.

    So the behavior in 6.2.8 seems straightforward and consistent: presumably all of your keys are on the "inactive list" (not too surprising for a scanning-type access pattern, or perhaps it's just that none of them had a chance to get promoted yet), and all of them get tossed out. I'm less clear about 5.16, but it probably relates to the local free list and all of the updates running from the same CPU.

    Basically I think that the data type wasn't meant to be used in the way you're using it, and the bug is in your expectations. If you disagree, I think you'll have to take it up with the kernel developers.