How do I initialize/reinitialize BPF_MAP_TYPE_PERCPU_HASH entry to zero for all CPUs?

I am writing an XDP bpf program that counts the number of bytes in/out and packets in/out for certain IPs. I am investigating using a BPF_MAP_TYPE_PERCPU_HASH for this. I have simplified my question a lot, so my use case is not completely as stated here, but it is close enough.

The map is defined as below

struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __type(key, in_addr_t);
    __type(value, ip_stats_t);
    __uint(pinning, LIBBPF_PIN_BY_NAME);
    __uint(max_entries, 64);
} ip_stats_map SEC(".maps");

ip_stats_t is defined as

struct ip_stats {
    __u64 total_bytes_in;
    __u64 total_bytes_out;
    __u64 total_pkts_in;
    __u64 total_pkts_out;
};
typedef struct ip_stats ip_stats_t;

The user mode adds an IP to the ip_stats_map map. The XDP code will check if an IP is in ip_stats_map, and if there is an entry will increment the necessary fields in ip_stats_t .

I know that when the user mode uses bpf_map_update_elem to update the map, it will only update for the CPU the user mode is running on. I tested that this is how it behaves. See map dump below where I initialized the stats structure value for one test IP to all 1s, and only one CPU was updated.

My questions are:

Is there anyways to update values for all CPUs?
Assuming this is not possible, how do people handle this kind of a use case when using BPF_MAP_TYPE_PERCPU_HASH. I can put some hacks to make the record for a CPU to be initialized in the ebpf code on first access. But what is the right way to do something like this?
I can possibly use a non PERCPU hash map, and use __sync_fetch_and_add (?) to increment the stats, but I am thinking that the performance might be worse than using PERCPU updates.

Dump of a map that was updated from user mode during my tests: The user mode added an entry for IP 1.1.1.1, and the stats were initialized to all 1s for testing. As can be seen only one CPU was updated (which is expected). I was hoping the other CPU values would be zeroed out, but that is not the case.

[{
        "key": 16843009,
        "values": [{
                "cpu": 0,
                "value": {
                    "total_bytes_in": 1,
                    "inner_bytes_in": 1,
                    "total_bytes_out": 1,
                    "inner_bytes_out": 1
                }
            },{
                "cpu": 1,
                "value": {
                    "total_bytes_in": 14850794016,
                    "inner_bytes_in": 140511820856688,
                    "total_bytes_out": 140736605830592,
                    "inner_bytes_out": 140511820905874
                }
            },{
                "cpu": 2,
                "value": {
                    "total_bytes_in": 1,
                    "inner_bytes_in": 20393728,
                    "total_bytes_out": 41,
                    "inner_bytes_out": 41
                }
            },{
                "cpu": 3,
                "value": {
                    "total_bytes_in": 0,
                    "inner_bytes_in": 761108332907331584,
                    "total_bytes_out": 2251795518727952,
                    "inner_bytes_out": 7549274428861842685
                }
            }
        ]
    }
]

My question is similar to this - bpf_map_update_elem() not updating all CPUs for BPF_MAP_TYPE_LRU_PERCPU_HASH , but that thread did not completely answer my question.

Solution

I know that when the user mode uses bpf_map_update_elem to update the map, it will only update for the CPU the user mode is running on. I tested that this is how it behaves.

Uhm, actually, that is not how bpf_map_update_elem, and also is not what is happening. The bpf_map_update_elem and bpf_map_lookup_elem userspace functions assume the value pointer points to an array of values, the array being of the same size as the number of CPUs.

This is part of the description of bpf_map__update_elem from to the Libbpf API:

value – pointer to memory containing bytes of the value

value_sz – size in byte of value data memory; it has to match BPF map definition’s value_size. For per-CPU BPF maps value size has to be a product of BPF map value size and number of possible CPUs in the system (could be fetched with libbpf_num_possible_cpus()). Note also that for per-CPU values value size has to be aligned up to closest 8 bytes for alignment reasons, so expected size is: round_up(value_size, 8)

Libbpf provides a special helper to allow you to query the correct number of CPUs libbpf_num_possible_cpus(). Example usage:

int ncpus = libbpf_num_possible_cpus();
if (ncpus < 0) {
     // error handling
}
long values[ncpus];
bpf_map_lookup_elem(per_cpu_map_fd, key, values);

See map dump below where I initialized the stats structure value for one test IP to all 1s, and only one CPU was updated.

The other values seem to be scrambled/random, but what actually happened is that libbpf read more data from the provided pointer than you expected. So for the remaining CPUs it took memory from your stack (likely surrounding local variables, return values, or return instruction pointers) and wrote them to the map as values.

Is there anyways to update values for all CPUs?

So, yes. There actually is no way to do it per-CPU

I can possibly use a non PERCPU hash map, and use __sync_fetch_and_add (?) to increment the stats, but I am thinking that the performance might be worse than using PERCPU updates.

Correct. Using __sync_fetch_and_add (atomic add) is slower than PERCPU maps. That is because the CPU will still have to synchronize memory access, just in hardware, so contention still slows you down. And you are constantly invalidating other cores caches.

PERCPU maps use a lot more memory, but have the best performance for metrics. So its a CPU-Memory tradeoff. And you of-course have to combine results in userspace.