Search code examples
pythonweb-scrapingbeautifulsoup

How to use BeautifulSoup to scrape links in a html


I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage. For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:

<a href="/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning/student-data-collection-is-out-of-control" class="bl-bigger">Data Collection Is Out of Control</a>

Here's my code:

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
    href = link.get('href')
    if href == None:
        continue
    elif '/roomfordebate/' in href:
        link_set.add(href)    
for link in link_set:
    print link 

This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work? Thanks


Solution

  • I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?

    Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.

    debaters_div = soup.find('div', class_="nytint-discussion-content")
    

    Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.

    list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
    list_items[0].find('a')
    # <a href="/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning/student-data-collection-is-out-of-control">Data Collection Is Out of Control</a>
    

    So, your whole code can be:

    link_set = set()
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data)
    debaters_div = soup.find('div', class_="nytint-discussion-content")
    list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
    
    for each_item in list_items:
        html_link = each_item.find('a').get('href')
        if html_link.startswith('/roomfordebate'):
            link_set.add(html_link)
    

    Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.

    PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:

    link_set.add('http://www.nytimes.com' + html_link)