Search code examples
pythonunicodeencodingurllib2simplejson

Problems parsing a JSON which is read from a URL


I'm having a problem that I believe has a simple solution.

I'm writing a Python script which reads a JSON string from a URL and parses it. To do this I'm using urllib2 and simplejson.

The problem I'm having has got to do with encoding. The URL I'm reading from does not explicitly state in which encoding it is (as far as I can tell) and it returns some Icelandic characters. I cannot give out the URL I'm reading from here, but I've set up a sample JSON data file on my own server and I'm also having problems reading that. Here is the file: http://haukurhaf.net/json.txt

This is my code:

# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'

def fetchPage(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', user_agent)
    response = urllib2.urlopen(req)
    html = response.read()
    response.close()
    return html

html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)

The JSON parser crashes with this error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte

Since I do not have any control over the server which holds the JSON data, I cannot control which encoding headers it sends out. I'm hoping I can solve this on my end somehow.

Any ideas?


Solution

  • The file is encoded using Latin-1, not UTF-8, so you have to specify the encoding:

    jsonData = json.JSONDecoder('latin1').decode(html)
    

    BTW: html is a bad name for a JSON document...