Search code examples
pythonpython-2.7web-scrapingbeautifulsoupmechanize

Not Getting proper links from google search results using mechanize and Beautifulsoup


I am using the following snippet to get links from the google search results for the "keyword" I give.

import mechanize
from bs4 import BeautifulSoup
import re


def googlesearch():
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.set_handle_equiv(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0')] 
    br.open('http://www.google.com/')   

    # do the query
    br.select_form(name='f')   
    br.form['q'] = 'scrapy' # query
    data = br.submit()
    soup = BeautifulSoup(data.read())
    for a in soup.find_all('a', href=True):
        print "Found the URL:", a['href']
googlesearch()

Since am parsing the search results HTML page to get links.Its getting all the 'a' tags.But what I need is to get only the links for the results.Another thing is when you see the output of the href attribute it gives something like this

Found the URL: /search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ

But the actual link present in href attitube is http://scrapy.org/

Can anyone point me the solution for the above two questions mentioned above??

Thanks in advance


Solution

  • Get only the links for the results

    The links you're interested in are inside the h3 tags (with r class):

    <li class="g">
      <h3 class="r">
        <a href="/url?q=http://scrapy.org/&amp;sa=U&amp;ei=XdIUU8DOHo-ElAXuvIHQDQ&amp;ved=0CBwQFjAA&amp;usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
          <b>Scrapy</b> | An open source web scraping framework for Python
        </a>
      </h3>
      ..
    

    You can find the links using css selector:

    soup.select('.r a')
    

    Get the actual link

    URLs are in the following format:

    /url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
         ^^^^^^^^^^^^^^^^^^^^
    

    Actual url is in the q parameter.

    To get the the entire query string, use urlparse.urlparse:

    >>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
    >>> urlparse.urlparse(url).query
    'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
    

    Then, use urlparse.parse_qs to parse the query string and extract the q parameter value:

    >>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
    ['http://scrapy.org/']
    >>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
    'http://scrapy.org/'
    

    Final result

    for a in soup.select('.r a'):
        print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]
    

    output:

    http://scrapy.org/
    http://doc.scrapy.org/en/latest/intro/tutorial.html
    http://doc.scrapy.org/
    http://scrapy.org/download/
    http://doc.scrapy.org/en/latest/intro/overview.html
    http://scrapy.org/doc/
    http://scrapy.org/companies/
    https://github.com/scrapy/scrapy
    http://en.wikipedia.org/wiki/Scrapy
    http://www.youtube.com/watch?v=1EFnX1UkXVU
    https://pypi.python.org/pypi/Scrapy
    http://pypix.com/python/build-website-crawler-based-upon-scrapy/
    http://scrapinghub.com/scrapy-cloud