I have a very simple json I can't parse with simplejson module. Reproduction:
import simplejson as json
json.loads(r'{"translatedatt1":"Vari\351es"}')
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.5/simplejson/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 335, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 351, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 23 (char 23)
Anyone has an idea what's wrong and how to parse the json above correctly?
The string that is encoded there is: Variées
P.S. I use python 2.5
Thanks a lot!
That would be quite correct; Vari\351es
contains an invalid escape, the JSON standard does not allow for a \
followed by just numbers.
Whatever produced that code should be fixed. If that is impossible, you'll need to use a regular expression to either remove those escapes, or replace them with valid escapes.
If we interpret the 351
number as an octal number, that would point to the unicode code point U+00E9, the é
character (LATIN SMALL LETTER E WITH ACUTE). You can 'repair' your JSON input with:
import re
invalid_escape = re.compile(r'\\[0-7]{1,6}') # up to 6 digits for codepoints up to FFFF
def replace_with_codepoint(match):
return unichr(int(match.group(0)[1:], 8))
def repair(brokenjson):
return invalid_escape.sub(replace_with_codepoint, brokenjson)
Using repair()
your example can be loaded:
>>> json.loads(repair(r'{"translatedatt1":"Vari\351es"}'))
{u'translatedatt1': u'Vari\xe9es'}
You may need to adjust the interpretation of the codepoints; I choose octal (because Variées
is an actual word), but you need to test this more with other codepoints.