Search code examples
pythonpython-2.7rssurllib2urllib

Web scraping using urllib2


I am trying to scrape all the titles off of this RSS Feed:

http://www.quora.com/Python-programming-language-1/rss

This is my code for the same:

import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles =  re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
    print list[e]

However, instead of getting a list of titles as the output, I am getting a bunch of code from the rss source. What am I doing wrong?


Solution

  • You should use non-greedy mark (?) in expression:

    #allTitles =  re.compile('<title>(.*)</title>')
    allTitles =  re.compile('<title>(.*?)</title>')
    

    Without ? all text except last </title> placed in (.*) group...