I am trying to crate a unique list of all the hrefs on my anchor tags
from urllib2 import urlopen
from bs4 import BeautifulSoup
import pprint
url = 'http://barrowslandscaping.com/'
soup = BeautifulSoup(urlopen(url), "html.parser")
print soup
tag = soup.find_all('a', {"href": True})
set(tag)
for tags in tag:
print tags.get('href')
result:
http://barrowslandscaping.com/
http://barrowslandscaping.com/services/
http://barrowslandscaping.com/design-consultation/
http://barrowslandscaping.com/hydroseeding-sodding/
http://barrowslandscaping.com/landscape-installation/
http://barrowslandscaping.com/full-service-maintenance/
http://barrowslandscaping.com/portfolio/
http://barrowslandscaping.com/about-us/
http://barrowslandscaping.com/contact/
http://barrowslandscaping.com/design-consultation/
http://barrowslandscaping.com/full-service-maintenance/
I have tried moving the set(tag) into the for loop but that didnt change my results.
First, you can't call set()
in place, it's a conversion that returns a value.
tag_set = set(tags)
Second, set
doesn't necessarily understand the difference between Tag objects in BeautifulSoup. As far as it's concerned, two separate tags were found in the HTML so they're not unique and should both remain in the set. It has no idea that they have the same href value.
Instead, you should first extract the href attributes into a list and convert those to a set instead.
tags = soup.find_all('a', {"href": True})
# extract the href values to a new array using a list comprehension
hrefs = [tag.get('href') for tag in tags]
href_set = set(hrefs)
for href in href_set:
print href
This can be further simplified using a set comprehension:
tags = soup.find_all('a', {"href": True})
href_set = {tag.get('href') for tag in tags}
for href in href_set:
print href