Search code examples
linuxsocketskernelebpf

Difference between BPF_PROG_TYPE_SOCK_OPS and BPF_PROG_TYPE_CGROUP_SOCK


The BPF_PROG_TYPE_SOCK_OPS and BPF_PROG_TYPE_CGROUP_SOCK programs seen to be very similar. According to the kernel source, the following are the definitions of the two program types:

BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock,
          struct bpf_sock, struct sock)

BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops,
          struct bpf_sock_ops, struct bpf_sock_ops_kern)

Is the CGROUP_SOCK a subset of the SOCK_OPS program type? Because its associated bpf_sock seems to have common fields as bpf_sock_ops.

Edit: While testing, I also realized that the bpf_sock struct only allows restricted access to the source and destination IPs. Does this reinforce that CGROUP_SOCK is a subset of the SOCK_OPS program type?


Solution

  • This is the commit that introduced the BPF_PROG_TYPE_SOCK_OPS program type: https://github.com/torvalds/linux/commit/40304b2a1567fecc321f640ee4239556dd0f3ee0

    The following is from the commit message which outlines the differences quite nicely:

    Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding struct that allows BPF programs of this type to access some of the socket's fields (such as IP addresses, ports, etc.). It uses the existing bpf cgroups infrastructure so the programs can be attached per cgroup with full inheritance support. The program will be called at appropriate times to set relevant connections parameters such as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection information such as IP addresses, port numbers, etc.

    Although there are already 3 mechanisms to set parameters (sysctls, route metrics and setsockopts), this new mechanism provides some distinct advantages. Unlike sysctls, it can set parameters per connection. In contrast to route metrics, it can also use port numbers and information provided by a user level program. In addition, it could set parameters probabilistically for evaluation purposes (i.e. do something different on 10% of the flows and compare results with the other 90% of the flows). Also, in cases where IPv6 addresses contain geographic information, the rules to make changes based on the distance (or RTT) between the hosts are much easier than route metric rules and can be global. Finally, unlike setsockopt, it oes not require application changes and it can be updated easily at any time.

    Although the bpf cgroup framework already contains a sock related program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type (BPF_PROG_TYPE_SOCK_OPS) because the existing type expects to be called only once during the connections's lifetime. In contrast, the new program type will be called multiple times from different places in the network stack code. For example, before sending SYN and SYN-ACKs to set an appropriate timeout, when the connection is established to set congestion control, etc. As a result it has "op" field to specify the type of operation requested.

    The purpose of this new program type is to simplify setting connection parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is easy to use facebook's internal IPv6 addresses to determine if both hosts of a connection are in the same datacenter. Therefore, it is easy to write a BPF program to choose a small SYN RTO value when both hosts are in the same datacenter.


    While testing, I also realized that the bpf_sock struct only allows restricted access to the source and destination IPs. Does this reinforce that CGROUP_SOCK is a subset of the SOCK_OPS program type?

    No, one is not the subset of the other, they are simply different program types intended for different use-cases. They have different contexts and each program type can have very specific rules around which fields are read-only and write-only.