I am trying to collect "a" tags which are in class="featured" from a site http://www.pakistanfashionmagazine.com I wrote this piece of code it has no error but it duplicates the links. How can I overcome this duplication ?
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
results= soup.findAll('div', attrs={"class":'featured'})
for div in results:
links = div.findAll('a')
for a in links:
print "http://www.pakistanfashionmagazine.com/" +a['href']
The actual HTML page has two links per item <div>
; one for the image, the other for the <h4>
tag:
<div class="item">
<div class="image">
<a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html" title="BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY"><img src="/siteimages/upload/BELLA-Embroidered-Lawn-Collection3-STITCHED-SUITSPKR-14000-ONLY_1529IM1-thumb.jpg" alt="Featured Product" /></a> </div>
<div class="detail">
<h4><a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html">BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</a></h4>
<em>updated: 2013-06-03</em>
<p>BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</p>
</div>
</div>
Limit your links to just one or the other; I'd use CSS selectors here:
links = soup.select('div.featured .detail a[href]')
for link in links:
print "http://www.pakistanfashionmagazine.com/" + link['href']
Now 32 links are printed, not 64.
If you needed to limit this to just the second featured
section (Beauty Tips), then do so; select the featured
divs, pick the second from the list, then
links = soup.select('div.featured')[1].select('.detail a[href]')
Now you have just the 8 links in that section.