How to expose C++ serialized data to python using boost_python

We decided to expose one of our IPC (Inter Process Communication) modules written in C++ to python (I know, it's not the brightest idea). We use data packets that can be serialized and deserialized to/from std::string (behavior similar to Protocol Buffers, just not as efficient), so our IPC class returns and accepts std::string as well.

The problem with exposing that class to python is that std::string c++ type is converted to str python type, and in case the returned std::string consists of characters that cannot be decoded to UTF-8 (which is most of the time) I get the UnicodeDecodeError exception.

I managed to find two workarounds (or even "solutions"?) for this problem, but I am not particularly happy with any of them.

This is my c++ code to reproduce UnicodeDecodeError problem and to try the solutions:

/*
 * boost::python string problem
 */

#include <iostream>
#include <string>
#include <vector>
#include <boost/python.hpp>
#include <boost/python/suite/indexing/vector_indexing_suite.hpp>

struct Packet {
    std::string serialize() const {
        char buff[sizeof(x_) + sizeof(y_)];
        std::memcpy(buff, &x_, sizeof(x_));
        std::memcpy(buff + sizeof(x_), &y_, sizeof(y_));
        return std::string(buff, sizeof(buff));
    }
    bool deserialize(const std::string& buff) {
        if (buff.size() != sizeof(x_) + sizeof(y_)) {
            return false;
        }
        std::memcpy(&x_, buff.c_str(), sizeof(x_));
        std::memcpy(&y_, buff.c_str() + sizeof(x_), sizeof(y_));
        return true;
    }
    // whatever ...
    int x_;
    float y_;
};

class CommunicationPoint {
public:
    std::string read() {
        // in my production code I read that std::string from the other communication point of course
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        return p.serialize();
    }
    std::vector<uint8_t> readV2() {
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        std::string buff = p.serialize();
        std::vector<uint8_t> result;
        std::copy(buff.begin(), buff.end(), std::back_inserter(result));
        return result;
    }
    boost::python::object readV3() {
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        std::string serialized = p.serialize();
        char* buff = new char[serialized.size()];  // here valgrind detects leak
        std::copy(serialized.begin(), serialized.end(), buff);
        PyObject* py_buf = PyMemoryView_FromMemory(
            buff, serialized.size(), PyBUF_READ);
        auto retval = boost::python::object(boost::python::handle<>(py_buf));
        //delete[] buff;  // if I execute delete[] I get garbage in python
        return retval;
    }
};

BOOST_PYTHON_MODULE(UtfProblem) {
    boost::python::class_<std::vector<uint8_t> >("UintVec")
        .def(boost::python::vector_indexing_suite<std::vector<uint8_t> >());
    boost::python::class_<CommunicationPoint>("CommunicationPoint")
        .def("read", &CommunicationPoint::read)
        .def("readV2", &CommunicationPoint::readV2)
        .def("readV3", &CommunicationPoint::readV3);
}

It can be compiled with g++ -g -fPIC -shared -o UtfProblem.so -lboost_python-py35 -I/usr/include/python3.5m/ UtfProblem.cpp (in production we use CMake of course).

This is a short python script that loads my library and decodes the numbers:

import UtfProblem
import struct

cp = UtfProblem.CommunicationPoint()

#cp.read()  # exception

result = cp.readV2()
# result is UintVec type, so I need to convert it to bytes first
intVal = struct.unpack('i', bytes([x for x in result[0:4]]))
floatVal = struct.unpack('f', bytes([x for x in result[4:8]]))
print('intVal: {} floatVal: {}'.format(intVal, floatVal))

result = cp.readV3().tobytes()
intVal = struct.unpack('i', result[0:4])
floatVal = struct.unpack('f', result[4:8])
print('intVal: {} floatVal: {}'.format(intVal, floatVal))

In the first workaround instead of returning std::string I return std::vector<unit8_t>. It works OK, but I don't like the fact that it forces me to expose additional artificial python type UintVec that doesn't have any native support for the conversion to python bytes.

Second workaround is nice, because it exposes my serialized packet as a block of memory with native support for conversion to bytes, but it leaks the memory. I verified the memory leak using valgrind: valgrind --suppressions=../valgrind-python.supp --leak-check=yes -v --log-file=valgrindLog.valgrind python3 UtfProblem.py and apart from a lot of invalid reads (probably false positives) from python library it shows me

8 bytes in 1 blocks are definitely lost

in the line when I am allocating the memory for my buffer. If I delete the memory before returning from the function I will get some garbage in python.

Question:

How can I appropriately expose my serialized data to python? In C++ to represent the array of bytes we usually use std::string or const char*, which unfortunately doesn't port to python in a nice way.

If my second workaround seems to be OK for you, how can I avoid the memory leak?

If exposing return value as std::string is in general OK, how can I avoid the UnicodeDecodeError ?

Additional info:

g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Python 3.5.3
boost 1.62

Solution

According to AntiMatterDynamite comment, returning pythonic bytes object (using Python API) works perfectly fine:

PyObject* read() {
    Packet p;
    p.x_ = 999;
    p.y_ = 1234.5678;
    std::string buff = p.serialize();
    return PyBytes_FromStringAndSize(buff.c_str(), buff.size());
}