I have a server written in FastAPI, Python 3.8.13 that receives data through a form from an external service, which may include Hebrew letters. Until recently, the data arrived through a back proxy server that was written in Python 2.7 and everything worked well. The proxy code was look something like this:
from requests import post
from bottle import route, request
@route('/', method='POST')
def new_req():
post('https://....', dict(request.forms))
return ''
The backend server code looks something like this:
@app.post('/....')
async def get_message(request:Request, message:str=Form(...)):
print(message)
As long as the data arrived via the proxy, there were no problems with Hebrew letters. From the moment we asked the service to transfer the messages directly to the backend server, string like: 'שלום' were seen: 'שלו×'. The service declares that the data in Hebrew is sent in Unicode, so I tried to do something like this:
@app.post('/....')
async def get_message(request:Request, message:bytes=Form(...)):
print(message)
message = message.decode('utf-8')
print(message)
The result (for: 'שלום'):
b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'
ש×××
I tried replacing 'utf-8'
with different encodings, international or Hebrew, and each time I got new kind of gibberish. Does anyone have an idea what else to try?
This is a “double-UTF-8-encoded” string. That is, you start with the sequence of characters:
Encode them into UTF-8,
b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
Misinterpret this byte string as being ISO-8859-1 encoded, producing the nonsense string
ש×××
And encode that string into UTF-8, which becomes the observed byte sequence.
b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'
So, you need to figure out where this redundant encode operation is occurring, and find a way to tell your program “This bytes
object is already UTF-8, so don't re-encode it.”