Search code examples
pythonhtmlbeautifulsouptagsscraper

BeautifulSoup: extract text from anchor tag


I want to extract:

  • text from following src of the image tag and
  • text of the anchor tag which is inside the div class data

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page.

Here is my code:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

should extract:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)


Solution

  • All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

    As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

    import urlparse
    from urllib2 import urlopen
    from urllib import urlretrieve
    from BeautifulSoup import BeautifulSoup as bs
    import requests
    
    def getImages(url):
        #Download the images
        r = requests.get(url)
        html = r.text
        soup = bs(html)
        output_folder = '~/amazon'
        #extracting the images that in div(s)
        for div in soup.findAll('div', attrs={'class':'image'}):
            modified_file_name = None
            try:
                #getting the data div using findNext
                nextDiv =  div.findNext('div', attrs={'class':'data'})
                #use findNext again on previous object to get to the anchor tag
                fileName = nextDiv.findNext('a').text
                modified_file_name = fileName.replace(' ','-') + '.jpg'
            except TypeError:
                print 'skip'
            imageUrl = div.find('img')['src']
            outputPath = os.path.join(output_folder, modified_file_name)
            urlretrieve(imageUrl, outputPath)
    
    if __name__=='__main__':
        url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
        getImages(url)