Search code examples
pythonencodingrsscodecrss-reader

Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position


I want to build an RSS Feed Reader by myself. So I started up.

My Test Page, from where I get my feed is 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'.

It is a German page , because of that I choose as decoding "iso-8859-1". So here is the code.

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)
  1. First of all I start by opening the website I mentioned before. And so far there seems not to be any problem with opening it.
  2. After decoding the website I search in it for all expression which are inside a Link Tags.
  3. Now I select those links which have "rss" in them. Which get stored in a new list.
  4. With the new list, I start opening the links and search there fore there content.

And now start the problems. I decode those sides, still german sides, and I get errors like:

  • 'charmap' codec can't encode character '\x9f' in position 339: character maps to
  • 'charmap' codec can't encode character '\x9c' in position 43: character maps to
  • 'charmap' codec can't encode character '\x80' in position 131: character maps to

And I really have no Idea why it won't work. The data which is collected before the error appears gets written into an textfile.

Example for collected data:

Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt.

I hope someone can help me. Also other clues or information which will help me build my own rss feed reader are welcome.

Greetings Templum


Solution

  • Per miko and Wooble's comment:

    iso-8859-1 should be utf-8 since the XML returned says the encoding is utf-8:

    In [71]: sourceCode = opener.open(page).read()
    
    In [72]: sourceCode[:100]
    Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"
    

    and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. It's more error prone to be using only the re module.


    feedSource is a unicode since it is the result of a decoding:

            feedSource = opener.open(feed).read().decode("utf-8","replace")
    

    Therefore, line is also unicode:

        content = re.findall(r'<p>(.*?)</p>', feedSource)
        for line in content:
            ...
    

    tempTxt is a plain file handle (as opposed to one opened with io.open, which takes an encoding parameter). So tempTxt expects bytes (e.g. a str), not unicode.

    So encode the line before writing to the file:

            for line in content:
                tempTxt.write(line.encode('utf-8'))
    

    or define tempTxt using io.open and specify an encoding:

    import io
    with io.open(filename, "w", encoding='utf-8') as tempTxt:
        for line in content:
            tempTxt.write(line)
    

    By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions:

        except Exception as e:
            print(str(e))   
    

    and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try section are undefined. For example,

        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
    

    using feedSource in the call to re.findall may raise a NameError if an exception was raised before feedSource was defined.

    You might want to add a continue statement in the except-suite if you want Python to pass over this feed and move on to the next:

        except Exception as e:
            print(str(e))   
            continue