I read a webpage which contains hebrew characters, using:
response = ('').join(opener.open(url).readlines())
The result I get is mixed, some of the characters come back as unicode, which I can handle.
Some of the response seems garbled. In a format I cant recognize. An example of the recieved text is: שלך
More precisely, it looks like this (only a snippet...):
<h3 class="_52r al aps">About גדי</h3><div>שלך ....</div>
The text between the divs seems scrambled. Can I convert it to unicode?
You are looking at HTML entities; use the HTMLParser
library to decode these:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print h.unescape('שלך')
שלך
>>> h.unescape('שלך')
u'\u05e9\u05dc\u05da'
To read a full urllib2
response, just use .read()
:
response = opener.open(url).read()