Search code examples
pythonhtmlweb-scrapingbeautifulsoup

Getting Duplicate links in Scraping


I am trying to collect "a" tags which are in class="featured" from a site http://www.pakistanfashionmagazine.com I wrote this piece of code it has no error but it duplicates the links. How can I overcome this duplication ?

from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

results= soup.findAll('div', attrs={"class":'featured'})

for div in results:
    links = div.findAll('a')
for a in links:
    print "http://www.pakistanfashionmagazine.com/" +a['href']

Solution

  • The actual HTML page has two links per item <div>; one for the image, the other for the <h4> tag:

    <div class="item">
    
        <div class="image">
            <a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html" title="BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY"><img src="/siteimages/upload/BELLA-Embroidered-Lawn-Collection3-STITCHED-SUITSPKR-14000-ONLY_1529IM1-thumb.jpg" alt="Featured Product" /></a>                    </div>
    
      <div class="detail">
            <h4><a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html">BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</a></h4>
                                    <em>updated: 2013-06-03</em>
            <p>BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</p>
    
        </div>
    </div>
    

    Limit your links to just one or the other; I'd use CSS selectors here:

    links = soup.select('div.featured .detail a[href]')
    for link in links:
        print "http://www.pakistanfashionmagazine.com/" + link['href']
    

    Now 32 links are printed, not 64.

    If you needed to limit this to just the second featured section (Beauty Tips), then do so; select the featured divs, pick the second from the list, then

    links = soup.select('div.featured')[1].select('.detail a[href]')
    

    Now you have just the 8 links in that section.