Search code examples
network-programmingkernelebpf

eBPF map concurrent update and delete in a loop


Target scenario is that I want to implementing a flow logging using ebpf program and attach the program to different network interfaces using tc.

So the ebpf map will looks like this: key is a tuple of srcIp, srcPort, dstIp, dstPort, protocol. value is a struct of bytesTransmitted and packetsTransmitted.

So when each packet is seen by network interface, I will update the map to increment the packets and bytes.

However, there is max size of the ebpf map, and I will need to periodically cleanup the stale entries in that map, otherwise the map will be full pretty quickly.

So I plan to implement a userspace program that will periodically expire the flow map entries.

However, what's the implication of accessing that map concurrently from both userspace program and kernel? Should I use bpf_spin_lock? I am worried that acquiring lock on each packet can be expensive.

I also found a post https://justin.azoff.dev/blog/bpf_map_get_next_key-pitfalls/, and the author is using a weird way to iterate and delete map entries. However, I also found this in the kernel source tree, which is just deleting items in a while loop. https://elixir.bootlin.com/linux/latest/source/samples/bpf/trace_event_user.c#L108 Who is right and who is wrong? ( I guess I should trust kernel source tree more ).

Even with the example above, it's indicating that I can delete the map item while iterating the map. However, the post above did not mention anything like if I can concurrently updating the map element in the kernel.

I really appreciate the advice.


Solution

  • However, there is max size of the ebpf map, and I will need to periodically cleanup the stale entries in that map, otherwise the map will be full pretty quickly.

    So I plan to implement a userspace program that will periodically expire the flow map entries.

    Another approach might be to use an LRU map.

    However, what's the implication of accessing that map concurrently from both userspace program and kernel? Should I use bpf_spin_lock? I am worried that acquiring lock on each packet can be expensive.

    It depends on your goals. You are not required to use synchronization primitives when reading from user space. The syscall will copy the map value into the user supplied buffer after which it can't be modified. However, what can happen is the following:

    Lets say our map value has 2 fields, field1 and field2. If my eBPF program increments both fields (field1++; field2++;). Then the copy in the syscall could happen in between these two modifications. So you could end up with a value that is a combination of the map state before and after the eBPF program ran. For most applications such as statistics this isn't an issue. But if for whatever reason changes within the map value need to be atomic, then you will have to use a spin_lock, which is not ideal for performance if you expect to update the same map values concurrently a lot.

    For most stats related use-cases atomic operations from the eBPF side of per-CPU maps are enough to keep an accurate count. Then collect them periodically from userspace.

    Who is right and who is wrong?

    Both are right, the first example wants to delete only part of the map and observes that after deleting a key from a map, it can't be used to get the next key after it, so you need to watch order of operations to avoid re-starting the loop and doing extra work. The second example simply removes all values and does not care if it restarts from the beginning.

    So its alright to delete while iterating, but if you only want to do a partial deletion, you have to watch order of operations to not do extra work.

    Even with the example above, it's indicating that I can delete the map item while iterating the map. However, the post above did not mention anything like if I can concurrently updating the map element in the kernel.

    It depends on what you mean my "can". You computer will not explode or crash, its allowed, but unfortunately there is no synchronization mechanism over multiple syscalls. So you can't read a map value, modify it in userspace and write it back without eBPF being able to modify it. So these modification by eBPF between the read and update syscalls will be lost. To only way I know to do something like this would be to use map-in-maps to swap out full maps atomically which brings challenges of its own.