Search code examples
cnetworkingraw-socketsmtu

Why am I receiving packer bigger than with raw packet


I am trying to transfers a packet from an interface to another by using raw packets (just for playing). First I focused on received packets.

On my machine (archlinux, that has 192.168.30.3 as IP) I created this code:

#include <stdio.h>
#include <net/ethernet.h>       /* the L2 protocols */
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>

int main()
{
    int packet_socket = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));


    /* test reception */
    char packet[4096];
    struct sockaddr rcvaddr;
    struct in_addr addr;
    addr.s_addr = inet_addr("192.168.30.3");    //my ip

    // use nc to send a use packet
    while (1) {
        int len = sizeof(rcvaddr);
        int len_packet =
            recvfrom(packet_socket, packet, 4096, 0, &rcvaddr, &len);

        // check if the packet is for us
        struct iphdr *iph =
            (struct iphdr *) (packet + sizeof(struct ethhdr));
        if (iph->daddr != inet_addr("192.168.30.3"))
            continue;

        // check if tcp
        if (iph->protocol != IPPROTO_TCP)
            continue;

        printf("Total packet length: %d\n",
               sizeof(struct ethhdr) + ntohs(iph->tot_len));
    }
}

Then I run it as root and also execute nc -lp 12345 -n > /dev/null.

On another machine (debian, 192.168.30.4) I run dd if=/dev/urandom | nc 192.168.30.3 12345 which makes my previous program prints the length of the received packets.

From it, I see there is packets that are greater that the size of MTU (which is 1500 on the two machines). For instance I can read "Total packet length: 16962" from my program. ( Also observed by linux raw ethernet socket receive more bytes than MTU).

I know about IP fragmentation so I first thought about IP reassembling. However I read in man 7 raw: "Note that packet sockets don't reassemble IP fragments, unlike raw sockets." Because I used packet sockets (AF_PACKET) I should not have packet reassembling and then keep the MTU size right?

I also did sudo ethtool -K ens3 tx off sg off tso off and test with the value 0,1,2 and 3 in /proc/sys/net/ipv4/ip_no_pmtu_disc on both machines.

Do you think 192.168.30.4 is sending more the MTU? Or does my machine performs some reassembling despite what is written in the manual?

ethtool -k ens3 gives:

On 192.168.30.4:

seb@SERVER:~$ sudo ethtool -k ens3 
Features for ens3:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: off
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

On 192.168.30.3:

[seb@archlinux ~]$ sudo ethtool -k ens3
Features for ens3:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: off
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

Also, note that the two machines are qemu machine run by GNS3 with the following net options: -net none -device e1000,mac=0c:7e:08:49:13:00,netdev=gns3-0 -netdev socket,id=gns3-0,udp=127.0.0.1:20049,localaddr=127.0.0.1:20048


Solution

  • Since the observed total packet length is way greater than that of a typical jumbo frame (MTU 9k), it's apparent that the receiver side employs either Large Receive Offload (LRO) or Generic Receive Offload (GRO) thus reassembling smaller packets into larger ones on the network interface driver level. This might explain why the packet socket in question sees already reassembled (large) packets.

    In this specific case, ethtool -k output indicates clearly that LRO is always disabled whilst GRO is indeed active and can be adjusted. As per the discussion in comments, disabling GRO indeed bears fruit.