This is a part of my Instagram account backup
[
{
"media": [
{
"title": "\u00d0\u0094\u00d0\u00be\u00d1\u0080\u00d0\u00be\u00d0\u00b3\u00d0\u00be\u00d0\u00b9 \u00d0\u00b4\u00d1\u0080\u00d1\u0083\u00d0\u00b3"
}
]
}
]
To parse this I use Codable
struct BlogPost: Codable {
let media: [Media]
}
struct Media: Codable {
let title: String
}
But this code prints ÐоÑогой дÑÑг
let bundle = Bundle.main
let path = bundle.path(forResource: "posts_1", ofType: "json")
let content = try? String(contentsOfFile: path!)
let data = content!.data(using: .utf8)!
let result = try? JSONDecoder().decode([BlogPost].self, from: data)
print(result![0].media[0].title)
And it should print Дорогой друг. How to decode this string on iOS? I am also using mothereff.in to decode backup data.
Let's start by summarizing some details. Instagram is encoding the string "Дорогой друг"
as "\u00d0\u0094\u00d0\u00be\u00d1\u0080\u00d0\u00be\u00d0\u00b3\u00d0\u00be\u00d0\u00b9 \u00d0\u00b4\u00d1\u0080\u00d1\u0083\u00d0\u00b3"
Let's look at what this means. The Д
is the Unicode character U+0414. It has a UTF-8 encoding of D0 94
. Note that the encoded title in the JSON begins with \u00d0\u0094
. Then the о
is the Unicode character U+043E with a UTF-8 encoding of D0 BE
. And sure enough, the encoded title in the JSON has \u00d0\u00be
as the next set of values. So it seems that Instagram is encoding the string as UTF-8 while using the \uxxxx
escape characters. At least for the Cyrillic characters. The space is encoded as a regular space character.
The problem is that JSONDecoder
expects that if a string contains escaped characters in the form \uxxxx
, it assumes the code is the Unicode value, not part of the UTF-8 encoding. When it parses the title, it first sees \u00d0
. That's the Unicode character Ð
. Then it sees \u0094
. That's the Unicode character "CANCEL CHARACTER", a non-printable character. This continues and you end up with "ÐоÑогой дÑÑг"
.
JSONDecoder
has no built-in functionality to tell it how to handle Instagram's non-standard encoding of strings. So this means the only solution is to write a custom decoder.
Here is a working solution. Update your Media
struct as follows:
struct Media: Codable {
let title: String
init(title: String) {
self.title = title
}
init(from decoder: Decoder) throws {
let container = try decoder.container(keyedBy: CodingKeys.self)
let str = try container.decode(String.self, forKey: .title)
let data = Data(str.reduce([], { partialResult, char in
char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
partialResult.append(UInt8(scalar.value))
}
}))
let res = String(data: data, encoding: .utf8)
self.title = res ?? "" // some fallback as desired
}
}
This is fine if there's only the one value to handle. If you need to deal with this for more than one property, move the logic to a String
extension:
extension String {
var fromInstagramEncoding: String? {
let data = Data(self.reduce([], { partialResult, char in
char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
partialResult.append(UInt8(scalar.value))
}
}))
return String(data: data, encoding: .utf8)
}
}
Then the updated Media
code becomes:
struct Media: Codable {
let title: String
init(title: String) {
self.title = title
}
init(from decoder: Decoder) throws {
let container = try decoder.container(keyedBy: CodingKeys.self)
let str = try container.decode(String.self, forKey: .title)
self.title = str.fromInstagramEncoding ?? "" // some fallback as desired
}
}
Here's a complete example that can be run in a Playground:
struct BlogPost: Codable {
let media: [Media]
}
struct Media: Codable {
let title: String
init(title: String) {
self.title = title
}
init(from decoder: Decoder) throws {
let container = try decoder.container(keyedBy: CodingKeys.self)
let str = try container.decode(String.self, forKey: .title)
self.title = str.fromInstagramEncoding ?? ""
}
}
extension String {
var fromInstagramEncoding: String? {
let data = Data(self.reduce([], { partialResult, char in
char.unicodeScalars.reduce(into: partialResult) { partialResult, scalar in
partialResult.append(UInt8(scalar.value))
}
}))
return String(data: data, encoding: .utf8)
}
}
let instagramJSON = """
[
{
"media": [
{
"title" : "\\u00d0\\u0094\\u00d0\\u00be\\u00d1\\u0080\\u00d0\\u00be\\u00d0\\u00b3\\u00d0\\u00be\\u00d0\\u00b9 \\u00d0\\u00b4\\u00d1\\u0080\\u00d1\\u0083\\u00d0\\u00b3"
}
]
}
]
"""
let badData = instagramJSON.data(using: .utf8)!
let result = try JSONDecoder().decode([BlogPost].self, from: badData)
print(result[0].media[0].title)
Output:
Дорогой друг
Note that this solution works with the provided example. It's possible that Instagram encodes some characters in such a way that this solution could fail in some cases. Without more data I can't know for sure. Post a comment with relevant details if you come across an example that this code doesn't handle correctly.