python unable to load a json file with utf-8 encoding

With the following python code:

filePath = urllib2.urlopen('xx.json')
fileJSON = json.loads(filePath.read().decode('utf-8'))

Where the xx.json looks like:

{
    "tags": [{
        "id": "123",
        "name": "Airport",
        "name_en": "Airport",
        "name_cn": "机场",
        "display": false
    }]
}

I see the following exception:

fileJSON = json.loads(filePath.read().decode('utf-8'))
    File "/usr/lib64/python2.7/json/__init__.py", line 339, in loads
        return _default_decoder.decode(s)
    File "/usr/lib64/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
    ValueError: No JSON object could be decoded

The code works before the Chinese characters are added to the json file, when I also added the .decode('utf-8') behind the read() as well.

I am not sure what needs to be done?

Solution

$ wget https://s3.amazonaws.com/wherego-sims/tags.json 
$ file tags.json 
tags.json: UTF-8 Unicode (with BOM) text, with CRLF line terminators

This file begins with a byte order mark (EF BB BF), which is illegal in JSON (JSON Specification and usage of BOM/charset-encoding). You must first decode this using 'utf-8-sig' in Python to get a valid JSON unicode string.

json.loads(filePath.read().decode('utf-8-sig'))

For what it's worth, Python 3 (which you should be using) will give a specific error in this case and guide you in handling this malformed file:

json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

Namely, by specifying that you wish to discard the BOM if it exists (again, it's not conventional to use this in UTF-8, particularly with JSON which is always encoded in UTF-8 so it is worse than useless):

>>> import json
>>> json.load(open('tags.json', encoding='utf-8-sig'))