I want to build an RSS Feed Reader by myself. So I started up.
My Test Page, from where I get my feed is 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'.
It is a German page , because of that I choose as decoding "iso-8859-1". So here is the code.
def main():
counter = 0
try:
page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
print(str(e))
#print sourceCode
try:
titles = re.findall(r'<title>(.*?)</title>',sourceCode)
links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
print(str(e))
rssFeeds = []
for link in links:
if "rss." in link:
rssFeeds.append(link)
for feed in rssFeeds:
if ('html' in feed) or ('htm' in feed):
try:
print("Besuche " + feed+ ":")
feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
except Exception as e:
print(str(e))
content = re.findall(r'<p>(.*?)</p>', feedSource)
try:
tempTxt = open("feed" + str(counter)+".txt", "w")
for line in content:
tempTxt.write(tagFilter(line))
except Exception as e:
print(str(e))
finally:
tempTxt.close()
counter += 1
time.sleep(10)
And now start the problems. I decode those sides, still german sides, and I get errors like:
And I really have no Idea why it won't work. The data which is collected before the error appears gets written into an textfile.
Example for collected data:
Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt.
I hope someone can help me. Also other clues or information which will help me build my own rss feed reader are welcome.
Greetings Templum
Per miko and Wooble's comment:
iso-8859-1
should be utf-8
since the XML returned says
the encoding is utf-8
:
In [71]: sourceCode = opener.open(page).read()
In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"
and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. It's more error prone to be using only the re
module.
feedSource
is a unicode
since it is the result of a decoding:
feedSource = opener.open(feed).read().decode("utf-8","replace")
Therefore, line
is also unicode
:
content = re.findall(r'<p>(.*?)</p>', feedSource)
for line in content:
...
tempTxt
is a plain file handle (as opposed to one opened with io.open
, which takes an encoding parameter). So tempTxt
expects bytes (e.g. a str
), not unicode
.
So encode the line
before writing to the file:
for line in content:
tempTxt.write(line.encode('utf-8'))
or define tempTxt
using io.open
and specify an encoding:
import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
for line in content:
tempTxt.write(line)
By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions:
except Exception as e:
print(str(e))
and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try
section are undefined. For example,
try:
print("Besuche " + feed+ ":")
feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
except Exception as e:
print(str(e))
content = re.findall(r'<p>(.*?)</p>', feedSource)
using feedSource
in the call to re.findall
may raise a NameError if an exception was raised before feedSource
was defined.
You might want to add a continue
statement in the except-suite
if you want Python to pass over this feed
and move on to the next:
except Exception as e:
print(str(e))
continue