Search code examples
cunionsbit-fields

Bitfield and Union - unexpected result in C


I've been given the following assignment in a C-course:

enter image description here

I've implemented the assignment to decode the 8 byte long long int 131809282883593 as follows:

    #include  <stdio.h>
    #include <string.h>

    struct Message {
         unsigned int hour : 5;
         unsigned int minutes : 6;
         unsigned int seconds : 6;
         unsigned int day : 5;
         unsigned int month : 4;
         unsigned int year : 12;
         unsigned long long int code : 26;
    };  // 64 bit in total

    union Msgdecode {
        long long  int datablob;
        struct Message elems;
    };

    int main(void) {

        long long int datablob = 131809282883593;
        union Msgdecode m;

        m.datablob = datablob;

        printf("%d:%d:%d %d.%d.%d code:%lu\n", m.elems.hour, m.elems.minutes,
        m.elems.seconds, m.elems.day, m.elems.month, m.elems.year,(long unsigned int) m.elems.code); 

        union Msgdecode m2;
        m2.elems.hour = 9;
        m2.elems.minutes = 0;
        m2.elems.seconds = 0;
        m2.elems.day = 30;
        m2.elems.month = 5;
        m2.elems.year = 2017;
        m2.elems.code = 4195376;

        printf("m2.datablob: should: 131809282883593 is: %lld\n", m2.datablob); //WHY does m2.datablob != m.datablob?! 
        printf("m.datablob:  should: 131809282883593 is: %lld\n", m.datablob);

        printf("%d:%d:%d %d.%d.%d code:%lu\n", m2.elems.hour, m2.elems.minutes,
          m2.elems.seconds, m2.elems.day, m2.elems.month, m2.elems.year, (long unsigned int) m2.elems.code);

    }

TRY IT ONLINE.

..what gives me a hard time is the output. The decoding/encoding works nicely so far. 9:0:0 30.5.2017 and code 4195376 is expected, but the difference in the 'datablob' really isn't - and i can't figure out why/where it stems from:

9:0:0 30.5.2017 code:4195376
m2.datablob: should: 131809282883593 is: 131810088189961
m.datablob:  should: 131809282883593 is: 131809282883593
9:0:0 30.5.2017 code:4195376

As you can see the datablob is close to the original - but not the original. I've consulted a coworker whos fluent in C about this - but we couldn't figure out the reason for this behaviour.

Q: Why do the blobs differ from each other?

Bonus-Q: When manipulating the union Msgdecode to include another field, a strange thing happens:

union Msgdecode {
    long long  int datablob;
    struct Message elems;
    char bytes[8];  // added this
};

Outcome:

9:0:0 30.5.2017 code:0
m2.datablob: should: 131809282883593 is: 8662973939721
m.datablob:  should: 131809282883593 is: 131809282883593
9:0:0 30.5.2017 code:4195376

PS: reading on SO about bitfields+union questions gave me the impression that they are rather unreliable. Can this be generally said?


Solution

  • The layout of bitfields within a struct and any padding that may exist between them is implementation defined.

    From section 6.7.2.1 of the C standard:

    11 An implementation may allocate any addressable storage unit large enough to hold a bit- field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.

    This means that you can't rely on the layout in a standard-compliant manner.

    That being said, lets take a look at how the bits are being laid out in this particular case. To reiterate, everything from here down is in the realm of implementation defined behavior. We'll start with the second case where m2.datablob is 8662973939721 as that is easier to explain.

    First let's look at the bit representation of the values you assign to m2:

     - hour:       9:   0 1001 (0x09)
     - minutes:    0:   00 0000 (0x00)
     - seconds:    0:   00 0000 (0x00)
     - day:       30:   11 1110 (0x3E)
     - month:      5:   0101 (0x05)
     - year:    2017:   0111 1110 0001 (0x7e1)
     - code: 4195376:   00 0100 0000 0000 0100 0011 0000 (0x0400430)
    

    Now let's look at the blob values, first m which assigns to blobvalue then m2 which assigns to each field individually with the above values:

    131809282883593  0x77E13D7C0009                        0111 0111 1110 0001 
                                       0011 1101 0111 1100 0000 0000 0000 1001
    
      8662973939721  0x07E1017C0009                        0000 0111 1110 0001 
                                       0000 0001 0111 1100 0000 0000 0000 1001
    

    If we start by looking at the values from the right going left, we can see the value 9, so there's our first 5 bits. Next is two sets of 6 zero bits for the next two fields. After that, we see the bit patterns for 30, then 5.
    A little further up we see the bit pattern for the value 2017, but there are 6 bits set to zero between this value and the prior ones. So it looks like the layout is as follows:

              year        ???  month  day   sec     min  hour
          ------------   -----  ---  ----  ------  ----- -----
         |            | |     ||   ||    ||      ||     |     |
    0000 0111 1110 0001 0000 0001 0111 1100 0000 0000 0000 1001
    

    So there's some padding between the year and month fields. Comparing the m and m2 representations, the differences are in the 6 bits of padding between month and year as well as 4 bits to the left of year.

    What we don't see here are the bits for the code field. So just how big is the struct?

    If we add this to the code:

    printf("size = %zu\n", sizeof(struct Message));
    

    We get:

    size = 16
    

    It's considerable bigger than we thought. So let's make the bytes array unsigned char [16] and output it. The code:

    int i;
    printf("m: ");
    for (i=0; i<16; i++) {
        printf(" %02x", m.bytes[i]);
    }
    printf("\n");
    printf("m2:");
    for (i=0; i<16; i++) {
        printf(" %02x", m2.bytes[i]);
    }
    printf("\n");
    

    Output:

    m:  09 00 7c 3d e1 77 00 00 00 00 00 00 00 00 00 00
    m2: 09 00 7c 01 e1 07 00 00 30 04 40 00 00 00 00 00
    

    Now we see the 0x0400430 bit pattern corresponding to the code field in the representation for m2. There are an additional 20 bits of padding before this field. Also note that the bytes are in the reverse order of the value which tells us we're on a little-endian machine. Given the way the values are laid out, it's also likely that the bits within each byte are also little-endian.

    So why the padding? It's most likely related to alignment. The first 5 fields are 8 bits or less, meaning they each fit into a byte. There is no alignment requirement for single bytes, so they are packed. The next field is 12 bits, meaning it needs to fit into a 16 bit (2 byte) field. So 6 bits of padding are added so this field starts on a 2 byte offset. The next field is 26 bits, which needs a 32 bit field. This would mean it needs to start on a 4 byte offset and use 4 bytes, however since this field is declared unsigned long long, which in this case is 8 bytes, the field uses up 8 bytes. Had you declared this field unsigned int it would probably still start on the same offset but only use up 4 more bytes instead of 8.

    Now what about the first case where the blob value is 131810088189961? Let's look at its representation compared to the "expected" one:

    131809282883593  0x77E13D7C0009                        0111 0111 1110 0001 
                                       0011 1101 0111 1100 0000 0000 0000 1001
    
    131810088189961  0x77E16D7C0009                        0111 0111 1110 0001 
                                       0110 1101 0111 1100 0000 0000 0000 1001
    

    These two representations have the same values in the bits that store the data. The difference between them is in the 6 padding bits between the month and year fields. As to why this representation is different, the compiler probably made some optimizations when it realized certain bits weren't or couldn't be read or written. By adding a char array to the union, it because possible that those bits could be read or written so that optimization could no longer be made.

    With gcc, you could try using __attribute((packed)) on the struct. Doing that gives the following output (after adjusting the bytes array to 8 along with the loop limits when printing):

    size = 8
    9:0:0 30.5.2127 code:479
    m2.datablob: should: 131809282883593 is: 1153216309106573321
    m.datablob:  should: 131809282883593 is: 131809282883593
    9:0:0 30.5.2017 code:4195376
    m:  09 00 7c 3d e1 77 00 00
    m2: 09 00 7c 85 1f 0c 01 10
    

    And the bit representation:

    1153216309106573321 0x10010C1F857C0009   0001 0000 0000 0001 0000 1100 0001 1111
                                             1000 0101 0111 1100 0000 0000 0000 1001
    
        131810088189961 0x77E16D7C0009       0000 0000 0000 0000 0111 0111 1110 0001 
                                             0110 1101 0111 1100 0000 0000 0000 1001
    

    But even with this, you could run into issues.

    So to summarize, with bitfields there's no guarantee of the layout. You're better off using bit shifting and masking to get the values in and out of the bitfields rather than attempting to overlay it.