I'm trying to load some geographic data with Python's simplejson
.
<!-- language: lang-py -->
string = file("prCounties.txt","r").read().decode('utf-8')
d = simplejson.loads(string)
The text file has a tilde, the word should be Añasco instead it's u"A\xf1asco"
which SimpleJson is not parsing. The source is a geoJson file from github
{"type": "FeatureCollection", "properties": {"kind": "state", "state": "PR"}, "features": [[{"geometry": {"type": "MultiPolygon", "coordinates": [[[[-67.122, 18.3239], [-67.0508, 18.3075], [-67.0398, 18.291], [-67.0837, 18.2527], [-67.122, 18.2417], [-67.1603, 18.2746], [-67.1877, 18.2691], [-67.2261, 18.2965], [-67.1822, 18.3129], [-67.1275, 18.3184]]]]}, "type": "Feature", "properties": {"kind": "county", "name": u"A\xf1asco", "state": "PR"}}]]}
Python gives me the error simplejson.decoder.JSONDecodeError: Expecting object
The script I used to load from GitHub to generate prCounties.txt
. The variable counties
is a list of strings related to the locations of the relevant GEOjson data.
It's clear this is not the proper way to save this data:
<!-- language: lang-py -->
countyGeo = [ ]
for x in counties:
d = simplejson.loads(urllib.urlopen("https://raw.github.com/johan/world.geo.json/master/countries/USA/PR/%s" % (x)).read())
countyGeo += [ d["features"][0]]
d["features"][0]=countyGeo
file("prCounties.txt", "w").write(str(d))
EDIT: In the last line, I replaced the str
with simplejson.dumps
. I guess it encodes properly now.
file("prCounties.txt", "w").write(simplejson.dumps(d))
There are two problems here. First:
string = file("prCounties.txt","r").read().decode('utf-8')
Why are you decoding it? JSON explicitly takes UTF-8 strings. That's part of the definition of JSON. The fact that simplejson
can handle Unicode strings makes it a little easier to use, but it effectively handles them by encoding them back to UTF-8, so… why not just leave it that way in the first place?
More importantly, where did your data come from? If prCounties.txt
has that u"Añasco"
in it, it's not JSON. You can't encode something to one standard and decode to a completely different standard just because they look similar.
If, for example, you did open('prCounties.txt', 'w').write(repr(my_dict))
, you have to read it back with a Python repr
parser (possibly ast.literal_eval
, or maybe you have to write something yourself).
Or, alternatively, if you want to parse the data as JSON, write it as JSON in the first place.
According to your comment, the data was read from https://raw.github.com/johan/world.geo.json/master/countries/USA/PR/Añasco.geo.json
The raw contents of that URL are:
{"type":"FeatureCollection","properties":{"kind":"state","state":"PR"},"features":[
{"type":"Feature","properties":{"kind":"county","name":"Añasco","state":"PR"},"geometry":{"type":"MultiPolygon","coordinates":[[[[-67.1220,18.3239],[-67.0508,18.3075],[-67.0398,18.2910],[-67.0837,18.2527],[-67.1220,18.2417],[-67.1603,18.2746],[-67.1877,18.2691],[-67.2261,18.2965],[-67.1822,18.3129],[-67.1275,18.3184]]]]}}
]}
You'll notice that there is no "name": u"Añasco"
(or "name": u"A\xf1asco"
, or anything similar) there. You can read this just by calling read
—no need to decode it from UTF-8 or anything—and just pass it to simplejson.loads
and it works just fine:
$ curl -O https://raw.github.com/johan/world.geo.json/master/countries/USA/PR/Añasco.geo.json
$ cp Añasco.geo.json prCounties.txt
$ python
>>> import simplejson
>>> string = file("prCounties.txt","r").read()
>>> d = simplejson.loads(string)
>>> print d
{u'type': u'FeatureCollection', u'properties': {u'kind': u'state', u'state': u'PR'}, u'features': [{u'geometry': {u'type': u'MultiPolygon', u'coordinates': [[[[-67.122, 18.3239], [-67.0508, 18.3075], [-67.0398, 18.291], [-67.0837, 18.2527], [-67.122, 18.2417], [-67.1603, 18.2746], [-67.1877, 18.2691], [-67.2261, 18.2965], [-67.1822, 18.3129], [-67.1275, 18.3184]]]]}, u'type': u'Feature', u'properties': {u'kind': u'county', u'name': u'A\xf1asco', u'state': u'PR'}}]}
See, no errors at all.
Somewhere, you've done something to this data to turn it into something else which is not JSON. My guess is that, on top of doing a bunch of unnecessary extra decode
and encode
calls, you've also done a simplejson.loads
, then tried to re-simplejson.loads
the repr
of the dict
you got back. Or maybe you've JSON-encoded a dict
full of already-encoded JSON strings. Whatever you've done, that code, not the code you're showing us, is where the error is.
And the easiest fix is probably to generate prCounties.txt
properly in the first place. It's just 70-odd downloads of a few lines apiece, and it should take maybe 2 lines of bash or 4 lines of Python to do it…