Search code examples
pythonbeautifulsoupuniquehref

Creating a unique list with beautiful soup from href attribute Python


I am trying to crate a unique list of all the hrefs on my anchor tags

from urllib2 import urlopen

from bs4 import BeautifulSoup

import pprint

url = 'http://barrowslandscaping.com/'

soup = BeautifulSoup(urlopen(url), "html.parser")
print soup

tag = soup.find_all('a', {"href": True})
set(tag)
for tags in tag:
    print tags.get('href')

result:

http://barrowslandscaping.com/
http://barrowslandscaping.com/services/
http://barrowslandscaping.com/design-consultation/
http://barrowslandscaping.com/hydroseeding-sodding/
http://barrowslandscaping.com/landscape-installation/
http://barrowslandscaping.com/full-service-maintenance/
http://barrowslandscaping.com/portfolio/
http://barrowslandscaping.com/about-us/
http://barrowslandscaping.com/contact/
http://barrowslandscaping.com/design-consultation/
http://barrowslandscaping.com/full-service-maintenance/

I have tried moving the set(tag) into the for loop but that didnt change my results.


Solution

  • First, you can't call set() in place, it's a conversion that returns a value.

    tag_set = set(tags)
    

    Second, set doesn't necessarily understand the difference between Tag objects in BeautifulSoup. As far as it's concerned, two separate tags were found in the HTML so they're not unique and should both remain in the set. It has no idea that they have the same href value.

    Instead, you should first extract the href attributes into a list and convert those to a set instead.

    tags = soup.find_all('a', {"href": True})
    # extract the href values to a new array using a list comprehension
    hrefs = [tag.get('href') for tag in tags]
    href_set = set(hrefs)
    
    for href in href_set:
        print href
    

    This can be further simplified using a set comprehension:

    tags = soup.find_all('a', {"href": True})
    href_set = {tag.get('href') for tag in tags}
    
    for href in href_set:
        print href