C++ NTOH conversion with dispatcher - event queue

We are rewriting our legacy code in C to C++. At the core of our system, we have a TCP client, which is connected to master. Master will be streaming messages continuously. Each socket read will result in say an N number of message of the format - {type, size, data[0]}.

Now we don't copy these messages into individual buffers - but just pass the pointer the beginning of the message, the length and shared_ptr to the underlying buffer to a workers.

The legacy C version was single threaded and would do an inplace NTOH conversion like below:

struct Message {
   uint32_t something1; 
   uint16_t something2; 
};

process (char *message)
Message *m = (message);
m->something1 = htonl(m->something1);
m->something2 = htons(m->something2);

And then use the Message.

There are couple of issues with following the logging in new code.

Since we are dispatching the messages to different workers, each worker doing an ntoh conversion will cause cache miss issues as the messages are not cache aligned - i.e there is no padding b/w the messages.
Same message can be handled by different workers - this is the case where the message needs to processed locally and also relayed to another process. Here the relay worker needs the message in original network order and the local work needs to convert to host order. Obviously as the message is not duplicated both cannot be satisfied.

The solutions that comes to my mind are -

Duplicate the message and send one copy for all relay workers if any. Do the ntoh conversion of all messages belonging to same buffer in the dispatcher itself before dispatching - say by calling a handler->ntoh(message); so that the cache miss issue is solved.
Send each worker the original copy. Each worker will copy the message to local buffer and then do ntoh conversion and use it. Here each worker can use a thread-specific (thread_local) static buffer as a scratch pad to copy the message.

Now my question is

Is the option 1 way of doing ntoh conversion - C++sy? I mean the alignment requirement of the structure will be different from the char buffer. (we havent had any issue with this yet.). Using scheme 2 should be fine in this case as the scratch buffer can have alignment of max_align_t and hence should typecastable to any structure. But this incur copying the entire message - which can be quite big (say few K size)
Is there a better way to handle the situation?

Solution

Your primary issue seems to be how to handle messages that come in misaligned. That is, if each message structure doesn't have enough padding on the end of it so that the following message is properly aligned, you can trigger misaligned reads by reinterpreting a pointer to the beginning of a message as an object.

We can get around this a number of ways, perhaps the simplest would be to ntoh based on a single-byte pointer, which is effectively always aligned.

We can hide the nasty details behind wrapper classes, which will take a pointer to the start of a message and have accessors that will ntoh the appropriate field.

As indicated in the comments, it's a requirement that offsets be determined by a C++ struct, since that's how the message is initially created, and it may not be packed.

First, our ntoh implementation, templated so we can select one by type:

template <typename R>
struct ntoh_impl;

template <>
struct ntoh_impl<uint16_t>
{
    static uint16_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint16_t>(d[0]) << 8) |
                d[1];
    }
};

template <>
struct ntoh_impl<uint32_t>
{
    static uint32_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint32_t>(d[0]) << 24) |
               (static_cast<uint32_t>(d[1]) << 16) |
               (static_cast<uint32_t>(d[2]) <<  8) |
               d[3];
    }
};

template<>
struct ntoh_impl<uint64_t>
{
    static uint64_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint64_t>(d[0]) << 56) |
               (static_cast<uint64_t>(d[1]) << 48) |
               (static_cast<uint64_t>(d[2]) << 40) |
               (static_cast<uint64_t>(d[3]) << 32) |
               (static_cast<uint64_t>(d[4]) << 24) |
               (static_cast<uint64_t>(d[5]) << 16) |
               (static_cast<uint64_t>(d[6]) <<  8) |
               d[7];
    }
};

Now we'll define a set of nasty macros that will automatically implement accessors for a given name by looking up the member with the matching name in the struct proto (a private struct to each class):

#define MEMBER_TYPE(MEMBER) typename std::decay<decltype(std::declval<proto>().MEMBER)>::type

#define IMPL_GETTER(MEMBER) MEMBER_TYPE(MEMBER) MEMBER() const { return ntoh_impl<MEMBER_TYPE(MEMBER)>::ntoh(data + offsetof(proto, MEMBER)); }

Finally, we have an example implementation of the message structure you have given:

class Message
{
private:
    struct proto
    {
        uint32_t something1;
        uint16_t something2;
    };

public:
    explicit Message(uint8_t const *p) : data(p) {}
    explicit Message(char const *p) : data(reinterpret_cast<uint8_t const *>(p)) {}

    IMPL_GETTER(something1)
    IMPL_GETTER(something2)

private:
    uint8_t const *data;
};

Now Message::something1() and Message::something2() are implemented and will read from the data pointer at the same offsets they wind up being in Message::proto.

Providing the implementation in the header (effectively inline) has the potential to inline the entire ntoh sequence at the call site of each accessor!

This class does not own the data allocation it is constructed from. Presumably you could write a base class if there's ownership-maintaining details here.