python python-2.7 character-encoding mechanize-python

Python writes strange characters after downloading webpage with mechanize

I have a problem with downloading webpages and processing them. What I want to do is:

using mechanize to download webpages into a variable
printing out that webpage (before writing to a file for further processing)
search for given words on the webpage (it will be the future research) and count them how many I have found.

My problem is the character encoding, because I get

<title>csonthãﾩjas termãﾩsek - wikipãﾩdia</title>

instead of

<title>csonthéjas termések - wikipédia</title>

The problem exists with almost every accented and 'strange' characters, like áűóüő... When I simply write it out as a string, it works.

print 'csonthéjas termések - wikipédia'

Chardet says that it has ISO-8859-2 character encoding, but nothing changes when I change my script encoding. When I try to encode or decode the webpage as any charset, I get an error ('invalid continuation byte' or 'ordinal not in range(128)'

I tried many encodings, different browser agents, detecting the encoding with chardet and then using that information but nothing solved my problem. I know this is a simple question but i could not find the correct answer for it. I use Windows 8.1 and Python 2.7.6

My code is the following (I tried to cut is to as simple as I could it):

#!/usr/bin/python
# -*- coding: ISO-8859-2 -*-

def url_get(url_input): #Get the webpage
    "Get the webpage"
    import mechanize
    url = url_input
    br = mechanize.Browser()
    br.set_handle_equiv(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    #User-agent','Mozilla/1.22 (compatible; MSIE 10.0; Windows 3.1)
    br.addheaders = [('user-agent', '   Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3'),
('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
    result = br.open(url).read().lower()
    print result

    import chardet    
    rawdata = result
    detection = chardet.detect(rawdata)
    charenc = detection['encoding']
    print charenc

    return result

text = url_get('http://hu.wikipedia.org/wiki/Csonth%C3%A9jas_term%C3%A9sek')

print 'csonthéjas termések - wikipédia'

Solution

The page appears to be in UTF-8. Take your text and print text.decode('utf-8'). This works for me when I read the page content using the requests module.

You need to remove the lower() call, since converting to lowercase may corrupt the UTF-8 encoded text. If you want to convert to lowercase, call lower() after you decode.

When you use the # -*- coding line, you set the encoding of your script file. This has no effect on data that your script file reads. To deal with text data in different encodings, you need to decode the data after you read it in.