I am trying to run Python embedded within a simple C program. However, when I import a module, I got an error undefined symbol: PyUnicodeUCS2_DecodeUTF8
.
Upon further investigation, I discovered that the Python interpreter started under Py_Initialize();
uses UCS-4 encoding whereas the module I am trying to import uses UCS-2 encoding. I am asking if there is a way to initialize the Python Interpreter with the correct encoding. I am using centos7 linux system which mostly uses USC2 and I don't know why the embedded interpreter is using USC-4
C code: embed.c
#include <Python.h>
int main (int argc, char *argv[])
{
Py_Initialize();
pName = PyString_FromString(argv[1]); //get name of module to import
pModule = PyImport_Import(pName);
}
Python
print( __file__ + ": Encoding: " + str(sys.maxunicode)) #How I printed out the interpreter encoding which is 1114111
import torch
Makefile
gcc -I /usr/include/python2.7 embed.c -o embed -lpython2.7
The code compiles but I get this error message: undefined symbol: PyUnicodeUCS2_DecodeUTF8
.
There is no way to initialize the interpreter with the correct encoding. Whether the interpreter uses UCS2 or UCS4 is a compile-time choice. What you need to do is to recompile the entire module from source. If you do not have the sources for the module, then you must compile the Python 2.7 from source and be careful to not replace the system python 2.7 with it.
The UCS2 builds were considered a mistake because there the non-BMP characters will be represented as UTF-16 surrogate pairs that now become visible as separate codepoints. That is why Python 3 does not have this compile-time option, as it always uses UCS4 internally for strings that cannot be represented in UCS2.