Search code examples
cgccstrict-aliasing

Strict-Aliasing warnings and tcpdump example code


Simplifying the concept, the strict-aliasing rule states that an object should be accessed by a pointer of a compatible type or a pointer to char. That way, the compiler can make some assumptions about the code and make some optimizations.

Even though its interpretation can bring up some doubts and discussion, the rule itself isn't any state secret. So my questions are:

Why some respected organizations, which are maintained by experienced programmers, frequently submit codes which don't respect the strict aliasing rule? An example that I can give is tcpdump: On their website's tutorial on libpcap there's an example code that apparently breaks the strict aliasing rules several times. I've seem many other codes that do so too, specially when handling network packets.. Did they just got lucky the compiler didn't completely broke their code so it got unnoticed? Do they rely on the user compiling with -fno-strict-aliasing flag? That would be a possibility considering some respectful programmers - I think Linus Torvalds himself for example, as I've seem on some mailing list about a Linux snippet that would break with strict aliasing enabled - don't really think the optimization gained with the strict aliasing compensates the kind of bad assumptions the compiler could make. Or is it just bad code and a bad practice that unfortunately is intrinsicate in the programming community?

The other question is about that sniffex.c code from tcpdump: Why even when compiled with gcc -O5 -Wall -Wextra -Wstrict-aliasing=1 sniffex.c -lpcap gcc 5.4.0 don't issue any warnings on the strict aliasing rules being broken? Is it just because it doesn't detect those type-punnings easily when they don't have the address operator &?

I feel bad about bringing this topic once again (since there are so many other questions about it) but even though I understand the rule I can't seem to understand why it is ignored so often in lots of places..


EDIT:

The snippets of the tcpdump example code that apparently breaks the strict aliasing rule are:

void got_packet(u_char *args, const struct pcap_pkthdr *header, const u_char *packet)
{
...
/* declare pointers to packet headers */
const struct sniff_ethernet *ethernet;  /* The ethernet header [1] */
const struct sniff_ip *ip;              /* The IP header */
const struct sniff_tcp *tcp;            /* The TCP header */
const char *payload;                    /* Packet payload */
...
/* define ethernet header */
ethernet = (struct sniff_ethernet*)(packet);

/* define/compute ip header offset */
ip = (struct sniff_ip*)(packet + SIZE_ETHERNET);
...
/* define/compute tcp header offset */
tcp = (struct sniff_tcp*)(packet + SIZE_ETHERNET + size_ip);
...
/* define/compute tcp payload (segment) offset */
payload = (u_char *)(packet + SIZE_ETHERNET + size_ip + size_tcp);
...

There, they do some sort of overlaying with structures that represent the different parts of network packets to have an easier way to access each of the fields. In the bottom line, it uses several pointers that don't have an effective type of u_char (the original packet type) to access it, thus, I believe, violating the strict aliasing rule.


Solution

  • The strict aliasing rule is controversial.

    Bit of background:

    Note that "the strict aliasing rule" is not a formal term, but it refers the the paragraph 6.5/6 regarding effective type and 6.5./7 regarding accessing data through a pointer. The latter paragraph is the actual strict aliasing rule and it has been part of C for as long as the language has been standardized, so its existence should actually not come as a shock to anyone. The text in 6.5./7 is nearly identical all the way from the ANSI-C drafts to C11.

    However, this section was unclear in C90, because it focused on the type of the pointer used for the "lvalue access", rather than the type of the data actually stored there. Which made situations where you cast to void pointers unclear, such as when using memcpy, or when you are doing various forms of type punning.

    In C99 there was some attempt to clarify this by introducing effective type. This didn't actually change the wording of the strict aliasing rule much, just made the interpretation somewhat clearer. (It still remains one of the hardest parts in the standard to understand.)

    The original intent for the rule was to allow compilers to avoid weird worst-case assumptions, such as this example from the C99 rationale:

    int a;
    void f( double * b )
    {
      a = 1;
      *b = 2.0;
      g(a);
    }
    

    If the compiler can assume that b is not pointing at a, which should be a sensible assumption to make given the wildly different types, then it can optimize the function to

    a = 1;
    *b = 2.0;
    g(1); // micro-optimization, doesn't have to load `a` from memory
    

    So even though the rule has been there all the time, it wasn't a problem before somewhere along C99, when the gcc compiler in particular decided to go haywire and abuse the cases where different effective types were used. For example this code makes perfect sense, yet violates strict aliasing:

    uint32_t u32=0;
    uint16_t* p16 = (uint16_t*)&u32; // grab the ms/ls word (endian-dependent)
    *p16 = something;
    if(u32)
      do_stuff();
    

    The above would be very useful code in all manner of bit-twiddling and hardware-related programming. Most compilers will generate what the programmer expects, namely code that changes the ms/ls word of the 32 bit value then check if the function should be called.

    However, since the above code is formally undefined behavior because of the strict aliasing violation, compilers like gcc might decide to abuse it and generate code that always removes the call to do_stuff() from the machine code, since it may assume that nothing in the code changes u32 from having the value 0.

    To dodge that unwanted compiler behavior, the programmer has to go out of their way. Either make the u32 volatile so that the compiler is forced to read it - which blocks all optimizations on the variable and not just the undesired one. Or alternatively come up with a home-brewed union type containing one uint32_t and two uint16_t. Or possibly access the u32 byte per byte. Very inconvenient.


    Therefore programmers tend to rebel against the strict aliasing rule and write code the relies on the compiler not making incredibly weird optimizations based on strict aliasing. There exists many valid cases when you want to break up a chunk of data in different parts, such as when de-serializing a block of raw data bytes.

    For example if I receive serial data byte-by-byte and store it in an array of uint8_t that I, the programmer, know contains a uint16_t, I should be able to write code like (uint16_t*)array without the compiler making assumptions such as "oh look, this array is never used, lets optimize it away" or some other nonsense.

    Most compilers will not go crazy but generate the expected code. But they are allowed to go crazy by the standard. And with the growing popularity of gcc in hardware-related programming, this is becoming a serious problem for the embedded industry, where hardware-related programming is an everyday task, rather than an exotic special case.

    Overall, the standard committee has repeatedly failed to see this problem.


    And then of course, a lot of programmers actually don't know about the strict aliasing rule in the first place, which is most often the explanation of why they write code violating it.