Search code examples
pythonlistpython-2.7web-scrapingbeautifulsoup

Saving results to list from a for loop?


url = 'http://www.millercenter.org/president/speeches'

conn = urllib2.urlopen(url)
html = conn.read()


miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
        if link is not None:
            print link

Here's some of my output:

/president/washington/speeches/speech-3939
/president/washington/speeches/speech-3939
/president/washington/speeches/speech-3461
https://www.facebook.com/millercenter
https://twitter.com/miller_center
https://www.flickr.com/photos/miller_center
https://www.youtube.com/user/MCamericanpresident
http://forms.hoosonline.virginia.edu/s/1535/16-uva/index.aspx?sid=1535&gid=16&pgid=9982&cid=17637
mailto:[email protected]

I'm trying to web-scrape all the presidential speeches on the website millercenter.org/president/speeches, but am having difficulty saving the appropriate speech links from which I'll scrape the speech data. More explicitly, say I need George Washington's speech, accessible at http://www.millercenter.org/president/washington/speeches/speech-3461 - I need to be able to access that url only. I'm thinking of storing all the urls for all the speeches in a list, and then writing a for loop to scrape and clean all the data.


Solution

  • Convert it to a list comprehension:

    linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
    

    Slightly optimized:

    linklist = [href for href in (tag.get('href') for tag in links) if href is not None]