I currently am developing a Django app that calls to a Java REST API and retrieves multilingual results (the results are coming from Elasticsearch to begin with). I can retrieve the results and store them into an object just fine, but displaying them within Javascript is giving me junk - this is supposed to be Russian:
When converting it to a string or trying to convert to unicode, I get:
UnicodeEncodeError at /getObjectArticles
'ascii' codec can't encode characters in position 23-24: ordinal not in range(128)
I know the API is returning good data because calling with a Java app works fine. Any idea how to handle the incoming string so it will be recognizable characters?
EDIT: My ingest code..
g = requests.post(baseUrl, query_string)
except requests.exceptions.RequestException as e:
print e
try:
obj = g.json()
articleTitle = obj['hit']['title']
str(articleTitle) # This results in a Unicode error
articleTitle.decode("UTF-8") # This results in a Unicode error
EDIT: My Javascript/JQUERY
// Load article text
function getArticleText(articleId, index) {
console.log($('#result_number').val());
var es_url = gu.webapp_url + '/getArticle?articleId=' + encodeURIComponent(articleId) + "&index=" + encodeURIComponent(index);
$.get(es_url).success(function(data) {
console.log(data);
var decodedText = $("<div/>").html(data.text).text();
var decodedTitle = $("<div/>").html(data.articleTitle).text();
// Close Article View Button
$('#g2i2-article-info').html("<div id=\"closeArticleInfo\" class=\"closeWindow\">X</div>");
// Article Info Table
var articleTable = "<table class=\"table table-striped table-bordered table-condensed\">";
articleTable = articleTable + "<tr><td>Title</td><td>" + decodedTitle + "</td></tr>";
articleTable = articleTable + "<tr><td>Publication Date</td><td>" + data.pubDate + "</td></tr>";
articleTable = articleTable + "<tr><td>Source Name</td><td>" + data.sourceName + "</td></tr>";
articleTable = articleTable + "<tr><td>Location</td><td>" + data.locationName + "</td></tr>";
articleTable = articleTable + "<tr><td>URL</td><td>" + data.url + "</td></tr>";
articleTable = articleTable + "</table>"
$('#g2i2-article-info').append(articleTable);
// Article Text
$('#g2i2-article-info').append(decodedText);
$('#g2i2-article-info').css('display', 'block');
}).error(function(jqXHR, textStatus, errorThrown) {
console.log(textStatus + " " + errorThrown);
});
}
You already have Unicode data on your server; response.json()
produces Unicode values for any JSON string. There is no need to try and decode it.
It is the browser that is producing this Latin 1 Mojibake. The browser is sent UTF-8 (a multi-byte encoding) and the browser is interpreting individual bytes as Latin 1 characters instead. Your title, for example, starts with the Cyrilic text Со
, which is encoded to UTF-8, then misinterpreted as Latin 1:
>>> u'Со'
u'\u0421\u043e'
>>> u'Со'.encode('utf8')
'\xd0\xa1\xd0\xbe'
>>> print u'Со'.encode('utf8').decode('latin1')
Со
So the D0
A1
bytes in UTF-8, which form one codepoint, are being printed as two Latin-1 characters instead.
The Ñ
character is the D1
byte, which can be followed by about 33 non-printable second UTF-8 bytes to make a character in the range р
through to Ѡ
. Next is и
which is really и
, etc.
You need to figure out why the browser thinks your data is Latin 1.
Usually this is determined from the Content-Type
header sent to the browser; if it is set to text/html; charset=ISO-8851-1
then the browser will behave as if all text is Latin 1. It could be the HTML page has a <meta>
tag, one of <meta charset="ISO-8851-1">
or <meta http-equiv="Content-Type" content="text/html; charset="ISO-8851-1">
or similar, where there are several closely related encodings that all have similar Mojibake effects.
Another option is that you encoded it to UTF-8 explicitly, then managed to decode it somewhere to Latin-1 again before sending it to the browser.
And a 3rd option is that the JSON service you used itself sent you Latin-1 bytes in a JSON unicode string, giving you a Mojibake source. In that case you can still repair it by encoding to Latin 1 then decoding from UTF-8:
fixed = broken.encode('latin1').decode('utf8')
but do so only after you have verified that your data on the server is already Mojibaked.