Search code examples
arrayscundefined-behavior

Is accessing arrays out-of-bounds legal if what lies beyond those bounds is known in C? If not why not and how can this be worked around?


Take the following which works in GCC:

struct Int2 {
    int i[2];
};
struct Int4 {
    struct Int2 i2[2];
};
struct Int4 i4;
i4.i2[1].i[-1] = 10;
printf("%d\n", i4.i2[0].i[1]); // 10
static_assert(sizeof(struct Int4)==sizeof(int[4]), "sizeof(struct Int4)!=sizeof(int[4])");

However according a comment this is technically undefined behavior and not legal.

The rule that specifies pointer arithmetic, C 2018 6.5.6 8, defines it only for arithmetic within an array (including the end position one beyond the last element and treating a single object as an array of one element). This creates a "pointer provenance" property; if p[x] has behavior defined by the C standard, it can refer only to elements of an array p points to. Compilers may use this to reduce pointer arithmetic when optimizing, and this reduction may break code that attempts to use indexing outside the actual array.

If this comment is correct, this answers the first question: No. Which brings us to my actual questions:

  • If arrays use pointer arithmetic, then doing so would just generate a pointer to the out-of-bounds but known value. Could someone explain what could be going on behind the scenes that could prevent this from working, despite this working in practice on GCC?
  • Is there any way to 'circumvent' this undefined behavior? Some undefined behavior can be circumvented trivially, such as signed overflow, by casting to an unsigned type before performing. Similarly, assuming I only have access to a pointer to the second element of an i2 member, is there any way to access the first i2 member and/or the constituent int values without invoking undefined behavior?

Solution

  • Is accessing arrays out-of-bounds legal if what lies beyond those bounds is known in C?

    I take you to mean evaluating an expression of the form array[i], where array is an expression having array value (prior to decay), and i is either negative or greater than or equal to the number of elements in the array. No, that is not "legal", by which I mean that the C language specification does not define the behavior. It does not matter what lies outside the bounds of the array, or whether that is known in some sense.

    Why not

    Because the language spec says so. Specifically, it says that array[i] is equivalent to *((array) + (i)), where array is, as usual, subject to decay to a pointer. And it says explicitly that the pointer addition (array) + (i) is defined only if i is between 0 and the number of elements in the array, but that dereferencing the result has undefined behavior if i is equal to the number of elements in the array.

    But perhaps you are asking about rationale. The committee has not published an official rationale for those semantics, but it seems reasonable that they favored simpler rules with fewer exceptions. Additionally, they definitely avoid assuming an addressing model that supports "what lies beyond?" even being a sensible question.

    and how can this be worked around?

    Generally, don't rely on out-of-bounds array accesses.

    If the knowledge you claim about what lies beyond an array is based on the array being part of another object, however, then you may be able to use information about the container type. In your particular example, I would just change i4.i2[1].i[-1] to i4.i2[0].i[1]. Under other circumstances, you might be able to use pointer conversions to express the access you want relative to the container. In the worst case, you might need to perform a deeper refactoring to avoid out of bounds accesses.

    If arrays use pointer arithmetic

    They do.

    then doing so would just generate a pointer to the out-of-bounds but known value.

    Actually, yes, but that does not allow you to dereference the pointer. Perhaps that's inconsistent, but it's what the spec says.

    Could someone explain what could be going on behind the scenes that could prevent this from working, despite this working in practice on GCC?

    A compiler is permitted to assume that your code has well-defined behavior. It is permitted to do anything at all in the event of undefined behavior, whether it recognizes the UB or not. Compilers can and do leverage that to implement optimizations that are correct as long as all program behavior is defined, but that produce results different from what you might naively expect when the behavior is undefined. That you do not observe that in your particular example is not generalizable to other code or other compilers.

    Is there any way to 'circumvent' this undefined behavior?

    There are lots of ways. Some better, and some worse. Specifics depend on the situation.

    assuming I only have access to a pointer to the second element of an i2 member, is there any way to access the first i2 member and/or the constituent int values without invoking undefined behavior?

    The only way you could have access to the second element of one of your i2s without also having access to the first is if the access were through a pointer:

    process_second_i2(struct Int2 *x2) {
        // ...
    }
    

    But given that this is a pointer to the second element of an array, it is valid use it to access the first element of the same array:

    int x1_0 =  (x2 - 1)[0];
    
    // or
    
    int x1_1 =  x2[-1][1];
    

    But that has UB if x2 does not after all point to the second or subsequent element of an array.


    Side note: You previously claimed:

    In this case, the union is not allowed to have any padding,

    but the C language specification does not support that claim. The ABI of the machine you are compiling for will specify, and it might or might not support that claim. For example, it might specify that int is 4 bytes wide and all structure and union sizes are multiples of 16 bytes. In that case, your union Int2 would indeed contain padding, but your union Int4 would not contain any padding of its own.

    Under other circumstances, union Int4 might contain padding.