We decided to expose one of our IPC (Inter Process Communication) modules written in C++ to python (I know, it's not the brightest idea). We use data packets that can be serialized and deserialized to/from std::string
(behavior similar to Protocol Buffers, just not as efficient), so our IPC class returns and accepts std::string
as well.
The problem with exposing that class to python is that std::string
c++ type is converted to str
python type, and in case the returned std::string
consists of characters that cannot be decoded to UTF-8
(which is most of the time) I get the UnicodeDecodeError
exception.
I managed to find two workarounds (or even "solutions"?) for this problem, but I am not particularly happy with any of them.
This is my c++ code to reproduce UnicodeDecodeError
problem and to try the solutions:
/*
* boost::python string problem
*/
#include <iostream>
#include <string>
#include <vector>
#include <boost/python.hpp>
#include <boost/python/suite/indexing/vector_indexing_suite.hpp>
struct Packet {
std::string serialize() const {
char buff[sizeof(x_) + sizeof(y_)];
std::memcpy(buff, &x_, sizeof(x_));
std::memcpy(buff + sizeof(x_), &y_, sizeof(y_));
return std::string(buff, sizeof(buff));
}
bool deserialize(const std::string& buff) {
if (buff.size() != sizeof(x_) + sizeof(y_)) {
return false;
}
std::memcpy(&x_, buff.c_str(), sizeof(x_));
std::memcpy(&y_, buff.c_str() + sizeof(x_), sizeof(y_));
return true;
}
// whatever ...
int x_;
float y_;
};
class CommunicationPoint {
public:
std::string read() {
// in my production code I read that std::string from the other communication point of course
Packet p;
p.x_ = 999;
p.y_ = 1234.5678;
return p.serialize();
}
std::vector<uint8_t> readV2() {
Packet p;
p.x_ = 999;
p.y_ = 1234.5678;
std::string buff = p.serialize();
std::vector<uint8_t> result;
std::copy(buff.begin(), buff.end(), std::back_inserter(result));
return result;
}
boost::python::object readV3() {
Packet p;
p.x_ = 999;
p.y_ = 1234.5678;
std::string serialized = p.serialize();
char* buff = new char[serialized.size()]; // here valgrind detects leak
std::copy(serialized.begin(), serialized.end(), buff);
PyObject* py_buf = PyMemoryView_FromMemory(
buff, serialized.size(), PyBUF_READ);
auto retval = boost::python::object(boost::python::handle<>(py_buf));
//delete[] buff; // if I execute delete[] I get garbage in python
return retval;
}
};
BOOST_PYTHON_MODULE(UtfProblem) {
boost::python::class_<std::vector<uint8_t> >("UintVec")
.def(boost::python::vector_indexing_suite<std::vector<uint8_t> >());
boost::python::class_<CommunicationPoint>("CommunicationPoint")
.def("read", &CommunicationPoint::read)
.def("readV2", &CommunicationPoint::readV2)
.def("readV3", &CommunicationPoint::readV3);
}
It can be compiled with g++ -g -fPIC -shared -o UtfProblem.so -lboost_python-py35 -I/usr/include/python3.5m/ UtfProblem.cpp
(in production we use CMake of course).
This is a short python script that loads my library and decodes the numbers:
import UtfProblem
import struct
cp = UtfProblem.CommunicationPoint()
#cp.read() # exception
result = cp.readV2()
# result is UintVec type, so I need to convert it to bytes first
intVal = struct.unpack('i', bytes([x for x in result[0:4]]))
floatVal = struct.unpack('f', bytes([x for x in result[4:8]]))
print('intVal: {} floatVal: {}'.format(intVal, floatVal))
result = cp.readV3().tobytes()
intVal = struct.unpack('i', result[0:4])
floatVal = struct.unpack('f', result[4:8])
print('intVal: {} floatVal: {}'.format(intVal, floatVal))
In the first workaround instead of returning std::string
I return std::vector<unit8_t>
. It works OK, but I don't like the fact that it forces me to expose additional artificial python type UintVec
that doesn't have any native support for the conversion to python bytes
.
Second workaround is nice, because it exposes my serialized packet as a block of memory with native support for conversion to bytes, but it leaks the memory. I verified the memory leak using valgrind: valgrind --suppressions=../valgrind-python.supp --leak-check=yes -v --log-file=valgrindLog.valgrind python3 UtfProblem.py
and apart from a lot of invalid reads (probably false positives) from python library it shows me
8 bytes in 1 blocks are definitely lost
in the line when I am allocating the memory for my buffer. If I delete the memory before returning from the function I will get some garbage in python.
How can I appropriately expose my serialized data to python? In C++ to represent the array of bytes we usually use std::string
or const char*
, which unfortunately doesn't port to python in a nice way.
If my second workaround seems to be OK for you, how can I avoid the memory leak?
If exposing return value as std::string
is in general OK, how can I avoid the UnicodeDecodeError
?
Additional info:
According to AntiMatterDynamite comment, returning pythonic bytes
object (using Python API) works perfectly fine:
PyObject* read() {
Packet p;
p.x_ = 999;
p.y_ = 1234.5678;
std::string buff = p.serialize();
return PyBytes_FromStringAndSize(buff.c_str(), buff.size());
}