I have a problem with downloading webpages and processing them. What I want to do is:
My problem is the character encoding, because I get
<title>csonthãᄅjas termãᄅsek - wikipãᄅdia</title>
instead of
<title>csonthéjas termések - wikipédia</title>
The problem exists with almost every accented and 'strange' characters, like áűóüő
... When I simply write it out as a string, it works.
print 'csonthéjas termések - wikipédia'
Chardet says that it has ISO-8859-2 character encoding, but nothing changes when I change my script encoding. When I try to encode or decode the webpage as any charset, I get an error ('invalid continuation byte' or 'ordinal not in range(128)'
I tried many encodings, different browser agents, detecting the encoding with chardet and then using that information but nothing solved my problem. I know this is a simple question but i could not find the correct answer for it. I use Windows 8.1 and Python 2.7.6
My code is the following (I tried to cut is to as simple as I could it):
#!/usr/bin/python
# -*- coding: ISO-8859-2 -*-
def url_get(url_input): #Get the webpage
"Get the webpage"
import mechanize
url = url_input
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
#User-agent','Mozilla/1.22 (compatible; MSIE 10.0; Windows 3.1)
br.addheaders = [('user-agent', ' Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3'),
('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
result = br.open(url).read().lower()
print result
import chardet
rawdata = result
detection = chardet.detect(rawdata)
charenc = detection['encoding']
print charenc
return result
text = url_get('http://hu.wikipedia.org/wiki/Csonth%C3%A9jas_term%C3%A9sek')
print 'csonthéjas termések - wikipédia'
The page appears to be in UTF-8. Take your text
and print text.decode('utf-8')
. This works for me when I read the page content using the requests
module.
You need to remove the lower()
call, since converting to lowercase may corrupt the UTF-8 encoded text. If you want to convert to lowercase, call lower()
after you decode.
When you use the # -*- coding
line, you set the encoding of your script file. This has no effect on data that your script file reads. To deal with text data in different encodings, you need to decode the data after you read it in.