Search code examples
jsonswiftdecodablegoogle-suggest

How to decode a non-UTF8 encoded JSON array using Swift?


I'm encountering some weird edge cases when parsing JSON data from the Google search autocomplete API. This is my model for decoding the JSON data:

struct suggestOutputModel: Decodable {
    let query: String
    let suggestions: [String]?
    let thirdValue: [String]?
    let fourthValue: GoogleSuggestSubtypes?
    
    struct GoogleSuggestSubtypes: Decodable {
        let googlesuggestsubtypes: [[Int]]
        enum CodingKeys: String, CodingKey {
            case googlesuggestsubtypes = "google:suggestsubtypes"
        }
    }
    
    init(from decoder: Decoder) throws {
        var container = try decoder.unkeyedContainer()
        query = try container.decode(String.self)
        suggestions = try container.decode([String]?.self)
        thirdValue = try container.decodeIfPresent([String].self)
        fourthValue = try container.decodeIfPresent(GoogleSuggestSubtypes.self)
    }
}

Most of the time this works. But when I query the API with ( and attempt to parse the response, I get the following error:

Swift.DecodingError.dataCorrupted(Swift.DecodingError.Context(codingPath: [], debugDescription: "The given data was not valid JSON.", underlyingError: Optional(Error Domain=NSCocoaErrorDomain Code=3840 "Unable to convert data to string around line 1, column 100." UserInfo={NSDebugDescription=Unable to convert data to string around line 1, column 100., NSJSONSerializationErrorIndex=100})))

This is what the JSON response I receive from the API looks like:

["(", ["(x2+y2-1)x2y3\u003d0", "(", "(g)i-dle", "(a-b)^2", "(a+b)^3", "(g)i-dle nxde lyrics", "(a+b)(a-b)", "( ͡° ͜ʖ ͡°)", "(working title) riverside menu", "(working title) burger bar menu"],
    [], {
        "google:suggestsubtypes": [
            [512, 433, 131],
            [512, 433],
            [512, 433, 131],
            [433],
            [512],
            [512],
            [512],
            [512],
            [512],
            [512]
        ]
    }
]

There are some uncommon characters, so maybe that's the issue, though none of them seem UTF-8 incompatible? I also tried running file -I on the .txt file I received from the API when calling it in the browser, and it reports the encoding as UTF-8. In any case, the following test suggests the problem is indeed trying to decode using UTF-8:

let utf8Test: String = String(data: data, encoding: .utf8)! // Fatal error: Unexpectedly found nil while unwrapping an Optional value
let latin1Test: String = String(data: data, encoding: .isoLatin1)! // Succeeds, although some characters are represented in the \u format

So I then try manually decoding the JSON array containing uncommon characters (the suggestions) to String using Latin-1, re-encoding to UTF-8 data, and then decoding again to [String], as such:

init(from decoder: Decoder) throws {
    var container = try decoder.unkeyedContainer()
    query = try container.decode(String.self)
    do {
        suggestions = try container.decode([String]?.self)
    } catch { // If undecodable, try converting data from latin1 to utf8 and redecoding
        suggestions = {
            let latin1Data: Data = try! container.decode(Data.self)
            let string: String = String(data: latin1Data, encoding: .isoLatin1)!
            let utf8Data: Data = string.data(using: .utf8)!
            return try! JSONDecoder().decode([String]?.self, from: utf8Data)
        }()
    }
    thirdValue = try container.decodeIfPresent([String].self)
    fourthValue = try container.decodeIfPresent(GoogleSuggestSubtypes.self)
}

However this throws the exact same error I received before, and I'm now out of ideas for how to solve this. Would also appreciate explanations for why I'm encountering this problem - if the file I receive from the API is ostensibly encoded in UTF-8, why would decoding with UTF-8 fail while Latin-1 succeeds?


Solution

  • My question above is convoluted and probably way too long. Leaving it as it is because it lists a series of failed solving attempts and maybe that's useful to see. Below, I clarify the problem and explain the solution I eventually found:

    Problem: Google has a publicly-accessible query autocomplete API that afaik isn't meant for public use. This API outputs JSON encoded in ISO Latin-1, even though JSON should be in Unicode. Swift's JSONDecoder conforms to standards and assumes the data it receives is encoded with Unicode. As a result, it sometimes (but not always!) fails to decode the API responses.

    Solution: Re-encode the data as UTF-8 and attempt to parse again, as shown below. For extra safety, obtain the actual encoding from the API response's headers (credits to this answer). In all my tests, it's always ISO Latin-1, but you can never be too sure when the API lacks public documentation.

    let url: URL = ... // construct a URL for querying the API
    let (data, response): (Data, URLResponse) = try await URLSession.shared.data(from: url)
    var output: YourJSONModel
    do {
        output = try JSONDecoder().decode(YourJSONModel.self, from: data)
    } catch { // If directly decoding fails
        
        // Get actual encoding
        let encodingHeader: String = response.textEncodingName!
        let encoding: String.Encoding = String.Encoding(rawValue: CFStringConvertEncodingToNSStringEncoding(CFStringConvertIANACharSetNameToEncoding(encodingHeader as CFString)))
        
        // Decode & re-encode in UTF-8
        let decodedData: String = String(data: data, encoding: encoding)!
        let utf8Data: Data = decodedData.data(using: .utf8)!
        
        // Try parsing data again
        output = try JSONDecoded().decode(YourJSONModel.self, from utf8Data)
    }