Search code examples
c++pointersmultiple-inheritancevirtual-functionsthis-pointer

compiler's detail of this pointer, virtual function and multiple-inheritance


I'm reading Bjarne's paper: Multiple Inheritance for C++.

In section 3, page 370, Bjarne said that "The compiler turns a call of a member function into an "ordinary" function call with an "extra" argument; that "extra" argument is a pointer to the object for which the member function is called."

I'm confused by the extra this argument. Please see the following two examples:

Example 1:(page 372)

class A {
    int a;
    virtual void f(int);
    virtual void g(int);
    virtual void h(int);
};
class B : A {int b; void g(int); };
class C : B {int c; void h(int); };

A class c object C looks like:

C:

-----------                vtbl:
+0:  vptr -------------->  -----------
+4:  a                     +0: A::f
+8:  b                     +4: B::g
+12: c                     +8: C::h
-----------                -----------  

A call to a virtual function is transformed into an indirect call by the compiler. For example,

C* pc;
pc->g(2)

becomes something like:

(*(pc->vptr[1]))(pc, 2)

The Bjarne's paper told me the above conclusion. The passing this point is C*.

In the following example, Bjarne told another story which totally confused me!


Example 2:(page 373)

Given two classes

class A {...};
class B {...};
class C: A, B {...};

An object of class C can be laid out as a contiguous object like this:

pc-->          ----------- 
                  A part
B:bf's this--> -----------  
                  B part
               ----------- 
                  C part
               -----------

Calling a member function of B given a C*:

C* pc;
pc->bf(2); //assume that bf is a member of B and that C has no member named bf.

Bjarne wrote: "Naturally, B::bf() expects a B* (to become its this pointer)." The compiler transforms the call into:

bf__F1B((B*)((char*)pc+delta(B)), 2);

Why here we need a B* pointer to be the this? If we just pass a *C pointer as the this, we can still access the members of B correctly I think. For example, to get the member of class B inside B::bf(), we just need to do something like: *(this+offset). this offset can be known by the compiler. Is this Right?


Follow up questions for example 1 and 2:

(1) When it's a linear chain derivation (example 1), why the C object can be expected to be at the same address as the B and in turn A sub-objects? There is no problem to use a C* pointer to access class B's members inside the function B::g in example 1? For example, we want to access the member b, what will happen in runtime? *(pc+8)?

(2) Why can we use the same memory layout (linear chain derivation) for the multiple-inheritance? Assuming in example 2, class A, B, C have exactly the same members as the example 1. A: int a and f; B: int b and bf (or call it g); C: int c and h. Why not just use the memory layout like:

 -----------               
+0:  a                     
+4:  b                    
+8: c                     
-----------   

(3) I've wrote some simple code to test the differences between the linear chain derivation and multiple-inheritance.

class A {...};
class B : A {...};
class C: B {...};
C* pc = new C();
B* pb = NULL;
pb = (B*)pc;
A* pa = NULL;
pa = (A*)pc;
cout << pc << pb << pa

It shows that pa, pb and pc have the same address.

class A {...};
class B {...};
class C: A, B {...};
C* pc = new C();
B* pb = NULL;
pb = (B*)pc;
A* pa = NULL;
pa = (A*)pc;

Now, pc and pa have the same address, while pb is some offset to pa and pc.

Why the compile make these differences?


Example 3:(page 377)

class A {virtual void f();};
class B {virtual void f(); virtual void g();};
class C: A, B {void f();};
A* pa = new C;
B* pb = new C;
C* pc = new C;
pa->f();
pb->f();
pc->f();
pc->g()

(1) The first question is about pc->g() which relates to the discussion in example 2. Does the compile do the following transformation:

pc->g() ==> g__F1B((*B)((char*)pc+delta(B)))

Or we have to wait for the runtime to do this?

(2) Bjarne wrote: On entry to C::f, the this pointer must point to the beginning of the C object (and not to the B part). However, it is not in general known at compile time that the B pointed to by pb is part of a C so the compiler cannot subtract the constant delta(B).

Why we cannot know the B object pointed to by pb is part of a C at the compile time? Based on my understanding, B* pb = new C, pb points to a created C object and C inherits from B, so a B pointer pb points to part of C.

(3) Assume that we do not know B pointer to by pb is part of a C at the compile time. So we have to store the delta(B) for the runtime which is actually stored with the vtbl. So the vtbl entry now looks like:

struct vtbl_entry {
    void (*fct)();
    int  delta;
}

Bjarne wrote:

pb->f() // call of C::f:
register vtbl_entry* vt = &pb->vtbl[index(f)];
(*vt->fct)((B*)((char*)pb+vt->delta)) //vt->delta is a negative number I guess

I'm totally confused here. Why (B*) not a (C*) in (*vt->fct)((B*)((char*)pb+vt->delta))???? Based on my understanding and Bjarne's introduction at the first sentence at 5.1 section an 377 page, we should pass a C* as this here!!!!!!

Followed by the above code snippet, Bjarne continued writing: Note that the object pointer may have to be adjusted to po int to the correct sub-object before looking for the member pointing to the vtbl.

Oh, Man!!! I totally have no idea of what Bjarne tried to say? Can you help me explain it?


Solution

  • Bjarne wrote: "Naturally, B::bf() expects a B* (to become its this pointer)." The compiler transforms the call into:

    bf__F1B((B*)((char*)pc+delta(B)), 2);
    

    Why here we need a B* pointer to be the this?

    Consider B in isolation: the compiler needs to be able to compile code ala B::bf(B* this). It doesn't know what classes might be further derived from B (and the introduction of derived code might not happen until long after B::bf is compiled). The code for B::bf won't magically know how to transform a pointer from some other type (e.g. C*) to a B* it can use to access data members and RunTime Type Info (RTTI / virtual dispatch table, typeinfo).

    Instead, the caller has the responsibility of extracting a valid B* to the B sub-object in whatever actual runtime type is involved (e.g. C). In this case, the C* holds the address of the start of the overall C object which likely matches the address of the A sub-object, and the B sub-object is some fixed but non-0 offset further into memory: it's that offset (in bytes) that must be added to the C* in order to get a valid B* with which to call B::bf - that adjustment is done when the pointer is cast from C* type to B* type.

    (1) When it's a linear chain derivation (example 1), why the C object can be expected to be at the same address as the B and in turn A sub-objects? There is no problem to use a C* pointer to access class B's members inside the function B::g in example 1? For example, we want to access the member b, what will happen in runtime? *(pc+8)?

    Linear derivation B : A and C : B can be thought of as successively tacking B-specific fileds on the end of A, then C-specific fields on the end of B (which is still B-specific fields tacked on the end of A). So the whole thing looks like:

    [[[A fields...]B-specific-fields....]C-specific-fields...]
     ^
     |--- A, B & C all start at the same address
    

    Then, when we talk about a "B" we're talking about all the embedded A fields as well as the additions, and for "C" there's still all the A and B fields: they all start at the same address.

    Regarding *(pc+8) - that's right (given the understanding that we're adding 8 bytes to the address, and not the usual C++ behaviour of adding multiples of the pointee's size).

    (2) Why can we use the same memory layout (linear chain derivation) for the multiple-inheritance? Assuming in example 2, class A, B, C have exactly the same members as the example 1. A: int a and f; B: int b and bf (or call it g); C: int c and h. Why not just use the memory layout like:

    -----------               
    +0:  a                     
    +4:  b                    
    +8: c                     
    -----------   
    

    No reason - that's exactly what happens... the same memory layout. The difference is that the B subobject doesn't consider A to be a part of itself. It's now like this:

    [[A fields...][B fields....]C-specific-fields...]
     ^             ^
     \ A&C start   \ B starts
    

    So when you call B::bf it wants to know where the B object starts - the this pointer you provide should be at "+4" in the above list; if you call B::bf using a C* then the compiler-generated calling code will need to add that 4 in to form the implicit this paramter to B::bf(). B::bf() can't simply be told where A or C start at +0: B::bf() knows nothing about either of those classes and doesn't know how to reach b or its RTTI if you give it a pointer to anything other than its own +4 address.