Search code examples
c++pointersnullgdbremote-debugging

C++ - Method returns non-null pointer according to gdb but the variable it's assigned to is null


I have a problem where I call a method from a statically linked library and the method returns a pointer to a datastructure. According to the debugger the value that is returned is non-null. But after the method returns and the value is assigned to a local variable, the variable is null.

The screen recording below demonstrates the problem. The recording starts before the method is called, then steps into the method and back out. As you can see, the method returns a pointer to the address 0x6920ae10 but then the value stored in the local pointer variable is 0x0.

enter image description here

I'm at a loss here... I have been using C++ for many years but i never encountered a problem like that before.. Am I missing something stupid here? What could cause this problem?

I compiled the statically linked library (LLRP for Impinj RFID Readers) just before, directly on the machine where the code is executed and i also just recompiled the whole program on the same machine, so I don't think it's a mismatch between the binary code on the remote machine and the code in the IDE.

The same code did work before, but now it's running on a different platform (on a Raspberry Pi instead of an Alix-board and on Raspbian instead of Ubuntu).

Update: I have been investigating this problem further today and i found that the problem occurs here (slightly changed to the code in the animation but the problem is the same):

::LLRP::CReaderEventNotificationData *p_msg_ren_d = ((::LLRP::CREADER_EVENT_NOTIFICATION *) p_msg)->getReaderEventNotificationData();

if (p_msg_ren_d == NULL)
{
    delete p_connection;
    delete p_msg;
    this->_fail("Invalid response from reader (1).");
     return;
}

This is the disassembly at the point where the method gets called (I'm compiling with -O0): (comments by me, with what i think is going on)

=> 0x001ee394 <+576>:   ldr r0, [r11, #-24] ; 0xffffffe8                    "Load address of p_msg into r0"
   0x001ee398 <+580>:   bl  0x1f0658 <LLRP::CREADER_EVENT_NOTIFICATION::getReaderEventNotificationData()> "call getReaderEventNotificationData"
   0x001ee39c <+584>:   str r0, [r11, #-28] ; 0xffffffe4                    "store r0 on the stack at sp-28"
   0x001ee3a0 <+588>:   ldr r3, [r11, #-28] ; 0xffffffe4                    "load sp-28 into r3"
   0x001ee3a4 <+592>:   cmp r3, #0                                          "check if rd is NULL"

Here is the c++ code and disassembly of the method that gets called (p_msg->getReaderEventNotificationData()):

inline CReaderEventNotificationData *
getReaderEventNotificationData (void)
{
    return m_pReaderEventNotificationData;
}
   0x001f0658 <+0>:     push    {r11}       ; (str r11, [sp, #-4]!) "save r11"
   0x001f065c <+4>:     add r11, sp, #0                             "save sp in r11"
   0x001f0660 <+8>:     sub sp, sp, #12                             "decrement sp by 12"
   0x001f0664 <+12>:    str r0, [r11, #-8]                          "store r0 on the stack at sp-8"
=> 0x001f0668 <+16>:    ldr r3, [r11, #-8]                          "load sp-8 into r3"
   0x001f066c <+20>:    ldr r3, [r3, #24]                           "load r3+24 into r3 THIS IS WRONG!"
                                                                    "m_pReaderEventNotificationData is at offset 28 not 24"
   0x001f0670 <+24>:    mov r0, r3                                  "move r3 into r0 as the return value"
   0x001f0674 <+28>:    add sp, r11, #0                             "restore sp"
   0x001f0678 <+32>:    pop {r11}       ; (ldr r11, [sp], #4)       "restore r11"
   0x001f067c <+36>:    bx  lr                                      "return"

If i take a look at the momory at the address p_msg, this is what it looks like:

0x69405de8: 0x002bcbf8  0x002b8774  0x00000000  0x69408200
0x69405df8: 0x69408200  0x5c5a5b1a  0x00000000  0x6940ed90
0x69405e08: 0x00000028  0x0000012d  0x694035f0  0x694007c8

So at offset 24, it's actually 0x00000000 and that's what returned by the method. But The correct value that should be returned is actually at offset 28 (0x6940ed90)

Is this a compiler problem? Or some 64 bit thing?

This is the compiler version btw: gcc version 8.3.0 (Raspbian 8.3.0-6+rpi1)


Solution

  • What could cause this problem?

    The most likely cause is that you've compiled your code with optimization, and are getting confused. Does the program proceed to report invalid response from reader, or does it actually continue to line 181.

    If the latter, see this answer.

    If the program really does go to execute line 179, then it is likely that your compiler has miscompiled your program (you'll need to disassemble the code to be sure).

    In that case, trying different compiler versions, disabling optimizations for a particular function or file, changing optimization levels, etc. etc. may let you work around the compiler bug.

    Update:

    The program does report the invalid response from reader, so it is actually called. I spent all afternoon investigating this again and at this point i believe it's a compiler error. In the disassembly i can see that it tries to load the value of m_pReaderEventNotificationData from the object-address+24 (ldr r3, [r3, #24]) but if i view the memory, at this offset is actually 0x000000. The real value that it should return is at offset #28 instead of #24.

    This is actually a very common problem, usually stemming from an ODR violation or an incomplete rebuilt.

    Suppose you have two object files: foo.o and bar.o, and also define

    const int NUM_X = 6;
    struct Bar {
      int m_x[NUM_X];
      void *m_p;
      void *Fn() { return m_p;}
    };
    

    Given above, Fn() will return *(this + 24), and this offset will be compiled into both object files.

    Now you change NUM_X from 6 to 7, and rebuild foo.o but not bar.o. Fn inside bar.o will still return *(this +24), but it should return *(this + 28) (assuming 32-bit binary).

    Similar behavior could happen if struct Bar is defined differently in foo.cc and bar.cc (ODR violation).

    Update 2:

    I deleted all traces of the library from the disk and recompiled the .a file and reinstalled the library and the headers. I also tried to recompile the program when the lib was not present and got a linker error so it's definitely not using another version of the lib that i don't know about... I also deleted the complete build of the program and fully recompiled it... But it's still the same behavior..

    You should verify that both files involved see the same definition of CREADER_EVENT_NOTIFICATION. Best to capture preprocessed files and compare the definition there (this is what the compiler actually sees). Be sure to use the exact compilation commands you used to build the library and the application.

    One sneaky way ODR violations can creep in is if the #defines in effect when building the library and the application are different. For example, consider this code:

    #ifdef NUM_XX
    const int NUM_X = NUM_XX;
    #else
    const int NUM_X = 6;
    #endif
    
    struct Bar {
      int m_x[NUM_X];
      void *m_p;
      void *Fn() { return m_p;}
    };
    

    Now compile foo.cc with -DNUM_XX=7 and bar.cc without it, and you've got an ODR violation.