I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'
So you have to get the header, and split it. Twice.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
... charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'
That's a surprising amount of steps for such a basic function. Am I missing something?
Have you checked this?
How to download any(!) webpage with correct charset in python?