c++templates compiler-optimization size-reduction

Optimize destructors size away

I'm building a code for an embedded system and I'm trying to save as much binary space as necessary.

The code is for parsing a protocol (MQTT for what it's worth), where there are numerous packets type and they are all different, but share some common parts.

Currently, to simplify writing the code, I'm using this pattern :

  template <PacketType type>
  struct ControlPacket
  {
      FixedHeader<type>    type;
      VariableHeader<type> header;
      Properties<type>     props;
      ... and so on...
  };   

  // Specialize for each type
  template <>
  struct FixedHeader<CONNECT>
  {
     uint8_t typeAndFlags;
     PacketType getType() const { return static_cast<PacketType>(typeAndFlags >> 4); }
     uint8 getFlags() const { return 0; }
     bool parseType(const uint8_t * buffer, int len) 
     { 
         if (len < 1) return false; 
         typeAndFlags = buffer[0]; 
         return true; 
     }
     ...  
  };

  template <>
  struct FixedHeader<PUBLISH>
  {
     uint8_t typeAndFlags;
     PacketType getType() const { return static_cast<PacketType>(typeAndFlags >> 4); }
     uint8 getFlags() const { return typeAndFlags & 0xF; }
     bool parseType(const uint8_t * buffer, int len) 
     { 
         if (len < 1) return false; 
         typeAndFlags = buffer[0];
         if (typeAndFlags & 0x1) return false;  // Example of per packet specific check to perform
         return true; 
     }
     ...  
  };

  ... For all packet types ...

This is working, and I'm now trying to reduce the binary impact of all those template specializations (else the code is almost duplicated 16 times)

So, I've came up to this paradigm:

   // Store the most common implementation in a base class
   struct FixedHeaderBase
   {
       uint8_t typeAndFlags;
       virtual PacketType getType() { return static_cast<PacketType(typeAndFlags >> 4); }
       virtual uint8 getFlags() { return 0; } // Most common code here
       virtual bool parseType(const uint8_t * buffer, int len) 
       { 
         if (len < 1) return false; 
         typeAndFlags = buffer[0]; 
         return true; 
       }

       virtual ~FixedHeaderBase() {}
   };

   // So that most class ends up empty
   template <>
   struct FixedHeader<CONNECT> final : public FixedHeaderBase
   {
   };

   // And specialize only the specific classes
   template <>
   struct FixedHeader<PUBLISH> final : public FixedHeaderBase
   {
       uint8 getFlags() const { return typeAndFlags & 0xF; }
       bool parseType(const uint8_t * buffer, int len) 
       { 
         if (!FixedHeaderBase::parseType(buffer, len)) return false; 
         if (typeAndFlags & 0x1) return false;  // Example of per packet specific check to perform
         return true; 
       }
   };

  // Most of the code is shared here
  struct ControlPacketBase
  {
     FixedHeaderBase & type;
     ...etc ...
     virtual bool parsePacket(const uint8_t * packet, int packetLen)
     {
        if (!type.parseType(packet, packetLen)) return false;
        ...etc ...
     }

     ControlPacketBase(FixedHeaderBase & type, etc...) : type(type) {} 
     virtual ~ControlPacketBase() {}
  };

  // This is only there to tell which specific version to use for the generic code
  template <PacketType type>
  struct ControlPacket final : public ControlPacketBase
  {
      FixedHeader<type>    type;
      VariableHeader<type> header;
      Properties<type>     props;
      ... and so on...

      ControlPacket() : ControlPacketBase(type, header, props, etc...) {}
  };

This is working quite well and allows to shave a lot of the used binary code space. By the way, I'm using final here so the compiler could devirtualize, and I'm compiling without RTTI (obviously also with -Os and each function in its own section that are garbage collected).

However, when I inspect the symbol table size, I'm finding a lot of duplication on the destructors (all template instances are implementing a destructor which is clearly the same (same binary size) or empty).

Typically, I understand that ControlPacket<CONNECT> needs to call ~FixedHeader<CONNECT>() and that ControlPacket<PUBLISH> needs to call ~FixedHeader<PUBLISH>() upon destruction.

Yet, since all the destructor are virtual, is there a way that the specialization of ControlPacket avoid their destructors and instead have ControlPacketBase to destruct them virtually so that I don't ends up with 16 useless destructors but only one ?

Solution

It's worth pointing out that this is related to an optimization called "identical COMDAT folding", or ICF. This is a linker feature where identical functions (i.e., empty functions) are all merged into one.

Not every linker supports this, and not every linker is willing to do this (because the language says that different functions require different address), but your toolchain could have this. It would be fast and easy.

I'm going to assume your problem is reproduced with this toy example:

#include <iostream>
#include <memory>
#include <variant>

extern unsigned nondet();

struct Base {
    virtual const char* what() const = 0;

    virtual ~Base() = default;
};

struct A final : Base {
    const char* what() const override {
        return "a";
    }
};

struct B final : Base {
    const char* what() const override {
        return "b";
    }
};

std::unique_ptr<Base> parse(unsigned v) {
    if (v == 0) {
        return std::make_unique<A>();
    } else if (v == 1) {
        return std::make_unique<B>();
    } else {
        __builtin_unreachable();
    }
}

const char* what(const Base& b) {
    return b.what();  // virtual dispatch
}

const char* what(const std::unique_ptr<Base>& b) {
    return what(*b);
}

int main() {
    unsigned v = nondet();
    auto packet = parse(v);

    std::cout << what(packet) << std::endl;
}

The disassembly shows that both A::~A and B::~B have (multiple) listings, even though they are empty and identical. This is with = default and final.

If one removes virtual, then these vacuous definitions go away and we achieve the goal - but now when the unique_ptr deletes the object, we invoke undefined behavior.

We have three choices for leaving the destructor non-virtual while maintaining well-defined behavior, two of which are useful and one is not.

Non-useful: the first option is to use shared_ptr. This works because shared_ptr actually type-erases its deleter function (see this question), so at no point does it delete via the base. In other words, when you make a shared_ptr<T>(u) for some u deriving from T, the shared_ptr stores a function pointer to U::~U directly.

However, this type erasure simply reintroduces the problem and generates even more empty virtual destructors. See the modified toy example to compare. I'm mentioning this for completeness, in case you happen to already be putting these into shared_ptr's on the side.

Useful: the alternative is to avoid virtual dispatch for lifetime management, and use a variant. It's not really proper to make such a blanket statement, but generally you can achieve smaller code and even some speedup with tag dispatch, as you avoid specifying vtables and dynamic allocation.

This requires the largest change in your code, because the object representing your packet must be interacted with in a different way (it is no longer an is-a relationship):

#include <iostream>

#include <boost/variant.hpp>

extern unsigned nondet();

struct Base {
    ~Base() = default;
};

struct A final : Base {
    const char* what() const {
        return "a";
    }
};

struct B final : Base {
    const char* what() const {
        return "b";
    }
};

typedef boost::variant<A, B> packet_t;

packet_t parse(unsigned v) {
    if (v == 0) {
        return A();
    } else if (v == 1) {
        return B();
    } else {
        __builtin_unreachable();
    }
}

const char* what(const packet_t& p) {
    return boost::apply_visitor([](const auto& v){
        return v.what();
    }, p);
}

int main() {
    unsigned v = nondet();
    auto packet = parse(v);

    std::cout << what(packet) << std::endl;
}

I used Boost.Variant because it produces the smallest code. Annoyingly, std::variant insists on producing some minor but present vtables for implementing itself - I feel like this defeats the purpose a bit, though even with the variant vtables the code remains much smaller overall.

I want to point out a nice result of modern optimizing compilers. Note the resulting implementation of what:

what(boost::variant<A, B> const&):
        mov     eax, DWORD PTR [rdi]
        cdq
        cmp     eax, edx
        mov     edx, OFFSET FLAT:.LC1
        mov     eax, OFFSET FLAT:.LC0
        cmove   rax, rdx
        ret

The compiler understands the closed set of options in the variant, the lambda duck-typing proved that each option really has a ...::what member function, and so it's really just picking out the string literal to return based on the variant value.

The trade-off with a variant is that you must have a closed set of options, and you no longer have a virtual interface enforcing certain functions exist. In return you get smaller code and the compiler can often see through the dispatch "wall".

However, if we define these simple visitor helper functions per "expected" member function, it acts as an interface checker - plus you've already got your helper class templates to keep things in line.

Finally, as an extension of the above: you are always free to maintain some virtual functions within the base class. This can offer the best of both worlds, if the cost of vtables is acceptable to you:

#include <iostream>

#include <boost/variant.hpp>

extern unsigned nondet();

struct Base {
    virtual const char* what() const = 0;

    ~Base() = default;
};

struct A final : Base {
    const char* what() const override {
        return "a";
    }
};

struct B final : Base {
    const char* what() const override {
        return "b";
    }
};

typedef boost::variant<A, B> packet_t;

packet_t parse(unsigned v) {
    if (v == 0) {
        return A();
    } else if (v == 1) {
        return B();
    } else {
        __builtin_unreachable();
    }
}

const Base& to_base(const packet_t& p) {
    return *boost::apply_visitor([](const auto& v){
        return static_cast<const Base*>(&v);
    }, p);
}

const char* what(const Base& b) {
    return b.what();  // virtual dispatch
}

const char* what(const packet_t& p) {
    return what(to_base(p));
}

int main() {
    unsigned v = nondet();
    auto packet = parse(v);

    std::cout << what(packet) << std::endl;
}

This produces fairly compact code.

What we have here is a virtual base class (but, no virtual destructor as it is not needed), and a to_base function that can take a variant and return for you the common base interface. (And in a hierarchy such as yours, you could have several of these per kind of base.)

From the common base you are free to perform virtual dispatch. This is sometimes easier to manage and faster depending on workload, and the additional freedom only costs some vtables. In this example, I've implemented what as first converting to the base class, and then perform virtual dispatch to the what member function.

Again, I want to point out the the definition of a visit, this time in to_base:

to_base(boost::variant<A, B> const&):
        lea     rax, [rdi+8]
        ret

The compiler understands the closed set of classes all inherit from Base, and so doesn't have to actually examine any variant type tag at all.

In the above I used Boost.Variant. Not everyone can or wants to use Boost, but the principles of the answer still apply: store an object and track what type of object is stored in an integer. When it's time to do something, peek at the integer and jump to the right place in code.

Implementing a variant is a whole different question. :)